• Instructions to Ask a Question

    For any assistance, please click the "Ask a Question" button and select the Pabbly product for which you require support.

    We offer seven comprehensive applications designed to help you efficiently manage and grow your business:

    Our support team endeavors to respond within 24 business hours (Monday to Friday, 10:00 AM to 6:00 PM IST). We appreciate your understanding and patience.

    🚀 Exclusive Lifetime Offers 🚀

    We invite you to take advantage of our special one-time payment plans, providing lifetime access to select applications:

    • 🔥 Pabbly Connect — Lifetime Access for $249View Offer
    • 🔥 Pabbly Subscription Billing — Lifetime Access for $249View Offer
    • 🔥 Pabbly Chatflow — Lifetime Access for $249View Offer

    Make a one-time investment and enjoy the advantages of robust business management tools for years to come.

Extract news article content

Chris_W

Member
Hi,

I am wondering if there is a way to extract post content from news article URLs using the API by Pabbly action?

My workflow is as follows:

>Import random URLs from RSS feed using API by Pabbly
>Retrieve <!DOCTYPE html> from one of the URLs found in step 1 using API by Pabbly
>Text parser to extract the post content AFTER: &quot;articleBody&quot;:&quot; BEFORE: &quot;
Error message: Field(s) text, after, before character length exceeded the allowed max value

The added complexity to this workflow is the random URLs imported in step 1 are always changing, so the html structure is always changing and will not always fall between the text mentioned in step 2.

Can you suggest a solution or a better way of achieving my outcome?

Thanks
 
P

Pabblymember11

Guest
The added complexity to this workflow is the random URLs imported in step 1 are always changing, so the html structure is always changing and will not always fall between the text mentioned in step 2.
Can you please elaborate on this condition which needs to be parsed to get the URL?
 

Chris_W

Member
I am using a RSS feed that shows news article links from mulitple news websites (eg Fox news, vice.com, nbcnews) and I can successfully obtain the URL from the RSS feed to be used in the next step.

In the next step, if I run the URL in Pabbly's API I receive a DOCTYPE html response as below

1690895618057.png


From here, I'm not sure how to extract the post content because 1) the DOCTYPE is too long and 2) the urls are always changing so the DOCTYPE structure is never the same

1690895746720.png


Hope that helps!
 

Chris_W

Member
Hey just a quick update, I have tried to use Pabbly's python code action but I am receiving error "invalid syntax (<string>, line 3)" on the below script, do you know what is going wrong?

pip install requests
pip install beautifulsoup4

import requests
from bs4 import BeautifulSoup

def scrape_article_text(url):
try:
response = requests.get(url)
response.raise_for_status() # Check if the request was successful
soup = BeautifulSoup(response.content, 'html.parser')

# Find all <p> tags and extract their text content
paragraphs = soup.find_all('p')

# Concatenate the text content of all <p> tags
article_text = "\n".join([p.get_text() for p in paragraphs])
return article_text

except requests.exceptions.RequestException as e:
print(f"Error fetching the URL: {e}")
return None
except Exception as ex:
print(f"Error parsing the article: {ex}")
return None

# Example usage:
url = 'result">3. Result : https://www.nbcnews.com/news/world/sun-bears-zoo-china-denies-humans-costume-rcna97477'
article_text = scrape_article_text(url)

if article_text:
print(article_text)
else:
print("Failed to scrape the article.")
 
P

Pabblymember11

Guest
Hey @Chris_W

Considering the sophistication of your use case to parse the code interpretation, I will highly advisable to enlist the services of an automation expert. This professional can adeptly design the required automation along with JS/Python code for your workflow and take care of all management aspects on your behalf.
 
Top