• Instructions to Ask a Question

    Click on the "Ask a Question" button and select the application for which you would like to ask questions.

    We have 5 different products namely - Pabbly Connect, Pabbly Subscription Billing, Pabbly Email Marketing, Pabbly Form Builder, Pabbly Email Verification.

    The turnaround time is 24 hrs (Business Hours - 10.00 AM to 6.00 PM IST, Except Saturday and Sunday). So your kind patience will be highly appreciated!

    🚀🚀Exclusive Discount Offer

    Just in case you're looking for any ongoing offers on Pabbly, you can check the one-time offers listed below. You just need to pay once and use the application forever -
     

    🔥 Pabbly Connect One Time Plan for $249 (🏆Lifetime Access) -  View offer 

    🔥 Pabbly Subscription Billing One Time Plan for $249 (🏆Lifetime Access) - View offer

Internal Server Error with "Extract Pattern" Action when Scraping a Specific Website

Mike4711

Member
I am encountering persistent issues while trying to scrape news headlines and links from the website www.aldenhoven.de and integrate them into a Google Sheet using Pabbly Connect. The website does not offer an RSS feed, so we are attempting to use the "API (Pabbly) - Execute API Request" action to retrieve the HTML content, followed by the "Text Formatter by Pabbly" - "Extract Pattern" action to extract the relevant data.

Here's a summary of the steps taken and the problems encountered:

Retrieving HTML: The "API (Pabbly) - Execute API Request" action successfully retrieves the HTML source code of www.aldenhoven.de. I have verified this by reviewing the response body in the test run.

Attempting to Extract Data (Initial Regex): My initial attempt with the "Extract Pattern" action used the following regular expression to target the news headlines (which appear to be within <h3> tags with the class SP-SlideTeaser_headline):
Code-Snippet

<h3 class="SP-SlideTeaser_headline">(.*?)</h3>.*?<a\s+href="([^"]*)"[^>]*>(.*?)</a>

This resulted in an "internal server error".

Simplified Regex Tests: To isolate the issue, I tried progressively simpler regular expressions:

<h3[^>]*>(.*?)</h3>: This also resulted in an "internal server error".
<h3[^>]*>: This ran without an error, but the "Result" field was empty, indicating no matches were found.
<h3: This also ran without an error, but the "Result" field remained empty.
<[^>]+> (a very general pattern to match any HTML tag): This also ran without an error, but the "Result" field was empty.

Attempting to Extract the Entire News Block: I then tried to extract the entire <article> block containing each news item using the following regex:
Code-Snippet

<article class="SP-SlideTeaser[^>]*>(.*?)</article>

This attempt also resulted in an "internal server error".

The fact that even the most basic regex patterns (<h3) return no results, while more specific or complex patterns lead to an "internal server error," suggests that there might be an issue beyond the regular expression itself. It's possible that:

Pabbly Connect's "Extract Pattern" function is encountering difficulties processing the specific HTML structure of this website.
There might be limitations in the size or complexity of the HTML being processed.
The website might be using techniques (like dynamic content loading via JavaScript) that result in the HTML fetched by the "API (Pabbly)" action being different from what is rendered in a browser. However, the basic tags like <h3> should still be present in the initial HTML if this were the sole issue.

Could you please investigate why the "Extract Pattern" action is failing (either with no results or an internal server error) when attempting to process the HTML content from www.aldenhoven.de? Are there any known limitations or specific configurations I might be missing? Are there alternative methods within Pabbly Connect that you would recommend for extracting data from websites without an RSS feed in such cases?

Thank you for your time and assistance with this matter.

Sincerely,
Michael
 

Preeti Paryani

Well-known member
Staff member
Hello @Mike4711,

Upon reviewing your workflow, we noticed that the pattern entered is not in regex format. Please ensure you enter a valid regex pattern. For more details, refer to the documentation linked in the help text. Additionally, we recommend testing your regex using the tools mentioned in the docs. If it works there, apply it to your workflow and let us know if you face any issues.


1742891231469.png


1742891256895.png
 

Similar threads

Top