• Instructions to Ask a Question

    Click on the "Ask a Question" button and select the application for which you would like to ask questions.

    We have 5 different products namely - Pabbly Connect, Pabbly Subscription Billing, Pabbly Email Marketing, Pabbly Form Builder, Pabbly Email Verification.

    The turnaround time is 24 hrs (Business Hours - 10.00 AM to 6.00 PM IST, Except Saturday and Sunday). So your kind patience will be highly appreciated!

    🚀🚀Exclusive Discount Offer

    Just in case you're looking for any ongoing offers on Pabbly, you can check the one-time offers listed below. You just need to pay once and use the application forever -
     

    🔥 Pabbly Connect One Time Plan for $249 (🏆Lifetime Access) -  View offer 

    🔥 Pabbly Subscription Billing One Time Plan for $149 (🏆Lifetime Access) - View offer

Extract news article content

Chris_W

Member
Hi,

I am wondering if there is a way to extract post content from news article URLs using the API by Pabbly action?

My workflow is as follows:

>Import random URLs from RSS feed using API by Pabbly
>Retrieve <!DOCTYPE html> from one of the URLs found in step 1 using API by Pabbly
>Text parser to extract the post content AFTER: &quot;articleBody&quot;:&quot; BEFORE: &quot;
Error message: Field(s) text, after, before character length exceeded the allowed max value

The added complexity to this workflow is the random URLs imported in step 1 are always changing, so the html structure is always changing and will not always fall between the text mentioned in step 2.

Can you suggest a solution or a better way of achieving my outcome?

Thanks
 

Supreme

Well-known member
Staff member
The added complexity to this workflow is the random URLs imported in step 1 are always changing, so the html structure is always changing and will not always fall between the text mentioned in step 2.
Can you please elaborate on this condition which needs to be parsed to get the URL?
 

Chris_W

Member
I am using a RSS feed that shows news article links from mulitple news websites (eg Fox news, vice.com, nbcnews) and I can successfully obtain the URL from the RSS feed to be used in the next step.

In the next step, if I run the URL in Pabbly's API I receive a DOCTYPE html response as below

1690895618057.png


From here, I'm not sure how to extract the post content because 1) the DOCTYPE is too long and 2) the urls are always changing so the DOCTYPE structure is never the same

1690895746720.png


Hope that helps!
 

Chris_W

Member
Hey just a quick update, I have tried to use Pabbly's python code action but I am receiving error "invalid syntax (<string>, line 3)" on the below script, do you know what is going wrong?

pip install requests
pip install beautifulsoup4

import requests
from bs4 import BeautifulSoup

def scrape_article_text(url):
try:
response = requests.get(url)
response.raise_for_status() # Check if the request was successful
soup = BeautifulSoup(response.content, 'html.parser')

# Find all <p> tags and extract their text content
paragraphs = soup.find_all('p')

# Concatenate the text content of all <p> tags
article_text = "\n".join([p.get_text() for p in paragraphs])
return article_text

except requests.exceptions.RequestException as e:
print(f"Error fetching the URL: {e}")
return None
except Exception as ex:
print(f"Error parsing the article: {ex}")
return None

# Example usage:
url = 'result">3. Result : https://www.nbcnews.com/news/world/sun-bears-zoo-china-denies-humans-costume-rcna97477'
article_text = scrape_article_text(url)

if article_text:
print(article_text)
else:
print("Failed to scrape the article.")
 

Supreme

Well-known member
Staff member
Hey @Chris_W

Considering the sophistication of your use case to parse the code interpretation, I will highly advisable to enlist the services of an automation expert. This professional can adeptly design the required automation along with JS/Python code for your workflow and take care of all management aspects on your behalf.
 
Top