Internal Server Error with "Extract Pattern" Action when Scraping a Specific Website

Mike4711 · Monday at 9:31 PM

I am encountering persistent issues while trying to scrape news headlines and links from the website www.aldenhoven.de and integrate them into a Google Sheet using Pabbly Connect. The website does not offer an RSS feed, so we are attempting to use the "API (Pabbly) - Execute API Request" action to retrieve the HTML content, followed by the "Text Formatter by Pabbly" - "Extract Pattern" action to extract the relevant data.

Here's a summary of the steps taken and the problems encountered:

Retrieving HTML: The "API (Pabbly) - Execute API Request" action successfully retrieves the HTML source code of www.aldenhoven.de. I have verified this by reviewing the response body in the test run.

Attempting to Extract Data (Initial Regex): My initial attempt with the "Extract Pattern" action used the following regular expression to target the news headlines (which appear to be within <h3> tags with the class SP-SlideTeaser_headline):
Code-Snippet

<h3 class="SP-SlideTeaser_headline">(.*?)</h3>.*?<a\s+href="([^"]*)"[^>]*>(.*?)</a>

This resulted in an "internal server error".

Simplified Regex Tests: To isolate the issue, I tried progressively simpler regular expressions:

<h3[^>]*>(.*?)</h3>: This also resulted in an "internal server error".
<h3[^>]*>: This ran without an error, but the "Result" field was empty, indicating no matches were found.
<h3: This also ran without an error, but the "Result" field remained empty.
<[^>]+> (a very general pattern to match any HTML tag): This also ran without an error, but the "Result" field was empty.

Attempting to Extract the Entire News Block: I then tried to extract the entire <article> block containing each news item using the following regex:
Code-Snippet

<article class="SP-SlideTeaser[^>]*>(.*?)</article>

This attempt also resulted in an "internal server error".

The fact that even the most basic regex patterns (<h3) return no results, while more specific or complex patterns lead to an "internal server error," suggests that there might be an issue beyond the regular expression itself. It's possible that:

Pabbly Connect's "Extract Pattern" function is encountering difficulties processing the specific HTML structure of this website.
There might be limitations in the size or complexity of the HTML being processed.
The website might be using techniques (like dynamic content loading via JavaScript) that result in the HTML fetched by the "API (Pabbly)" action being different from what is rendered in a browser. However, the basic tags like <h3> should still be present in the initial HTML if this were the sole issue.

Could you please investigate why the "Extract Pattern" action is failing (either with no results or an internal server error) when attempting to process the HTML content from www.aldenhoven.de? Are there any known limitations or specific configurations I might be missing? Are there alternative methods within Pabbly Connect that you would recommend for extracting data from websites without an RSS feed in such cases?

Thank you for your time and assistance with this matter.

Sincerely,
Michael

Preeti Paryani · Tuesday at 11:58 AM

Hello @Mike4711,

Please provide us with the workflow URL where you're encountering this issue so we can better assist you.

Mike4711 · Tuesday at 12:12 PM

Hi,
thank you for your prompt response. Here is the workflow URL where I am encountering the issue:

Pabbly

connect.pabbly.com

Preeti Paryani · Tuesday at 1:57 PM

Hello @Mike4711,

Upon reviewing your workflow, we noticed that the pattern entered is not in regex format. Please ensure you enter a valid regex pattern. For more details, refer to the documentation linked in the help text. Additionally, we recommend testing your regex using the tools mentioned in the docs. If it works there, apply it to your workflow and let us know if you face any issues.

Regular expressions - JavaScript | MDN

Regular expressions are patterns used to match character combinations in strings. In JavaScript, regular expressions are also objects. These patterns are used with the exec() and test() methods of RegExp, and with the match(), matchAll(), replace(), replaceAll(), search(), and split() methods of...

developer.mozilla.org

Internal Server Error with "Extract Pattern" Action when Scraping a Specific Website

Mike4711

Member

Preeti Paryani

Well-known member