Extract news article content

Chris_W · Aug 1, 2023

Hi,

I am wondering if there is a way to extract post content from news article URLs using the API by Pabbly action?

My workflow is as follows:

>Import random URLs from RSS feed using API by Pabbly
>Retrieve <!DOCTYPE html> from one of the URLs found in step 1 using API by Pabbly
>Text parser to extract the post content AFTER: "articleBody":" BEFORE: "
Error message: Field(s) text, after, before character length exceeded the allowed max value

The added complexity to this workflow is the random URLs imported in step 1 are always changing, so the html structure is always changing and will not always fall between the text mentioned in step 2.

Can you suggest a solution or a better way of achieving my outcome?

Thanks

Supreme · Aug 1, 2023

Chris_W said:
The added complexity to this workflow is the random URLs imported in step 1 are always changing, so the html structure is always changing and will not always fall between the text mentioned in step 2.

Can you please elaborate on this condition which needs to be parsed to get the URL?

Chris_W · Aug 1, 2023

I am using a RSS feed that shows news article links from mulitple news websites (eg Fox news, vice.com, nbcnews) and I can successfully obtain the URL from the RSS feed to be used in the next step.

In the next step, if I run the URL in Pabbly's API I receive a DOCTYPE html response as below

From here, I'm not sure how to extract the post content because 1) the DOCTYPE is too long and 2) the urls are always changing so the DOCTYPE structure is never the same

Hope that helps!

Chris_W · Aug 2, 2023

Hey just a quick update, I have tried to use Pabbly's python code action but I am receiving error "invalid syntax (<string>, line 3)" on the below script, do you know what is going wrong?

pip install requests
pip install beautifulsoup4

import requests
from bs4 import BeautifulSoup

def scrape_article_text(url):
try:
response = requests.get(url)
response.raise_for_status() # Check if the request was successful
soup = BeautifulSoup(response.content, 'html.parser')

# Find all <p> tags and extract their text content
paragraphs = soup.find_all('p')

# Concatenate the text content of all <p> tags
article_text = "\n".join([p.get_text() for p in paragraphs])
return article_text

except requests.exceptions.RequestException as e:
print(f"Error fetching the URL: {e}")
return None
except Exception as ex:
print(f"Error parsing the article: {ex}")
return None

# Example usage:
url = 'result">3. Result : https://www.nbcnews.com/news/world/sun-bears-zoo-china-denies-humans-costume-rcna97477'
article_text = scrape_article_text(url)

if article_text:
print(article_text)
else:
print("Failed to scrape the article.")

Supreme · Aug 2, 2023

Hey @Chris_W

Considering the sophistication of your use case to parse the code interpretation, I will highly advisable to enlist the services of an automation expert. This professional can adeptly design the required automation along with JS/Python code for your workflow and take care of all management aspects on your behalf.

Extract news article content

Chris_W

Member

Supreme

Well-known member

Chris_W

Member

Chris_W

Member

Supreme

Well-known member

Similar threads