• Instructions to Ask a Question

    For any assistance, please click the "Ask a Question" button and select the Pabbly product for which you require support.

    We offer seven comprehensive applications designed to help you efficiently manage and grow your business:

    Our support team endeavors to respond within 24 business hours (Monday to Friday, 10:00 AM to 6:00 PM IST). We appreciate your understanding and patience.

    🚀 Exclusive Lifetime Offers 🚀

    We invite you to take advantage of our special one-time payment plans, providing lifetime access to select applications:

    • 🔥 Pabbly Connect — Lifetime Access for $249View Offer
    • 🔥 Pabbly Subscription Billing — Lifetime Access for $249View Offer
    • 🔥 Pabbly Chatflow — Lifetime Access for $249View Offer

    Make a one-time investment and enjoy the advantages of robust business management tools for years to come.

Internal Server Error with "Extract Pattern" Action when Scraping a Specific Website

Mike4711

Member
I am encountering persistent issues while trying to scrape news headlines and links from the website www.aldenhoven.de and integrate them into a Google Sheet using Pabbly Connect. The website does not offer an RSS feed, so we are attempting to use the "API (Pabbly) - Execute API Request" action to retrieve the HTML content, followed by the "Text Formatter by Pabbly" - "Extract Pattern" action to extract the relevant data.

Here's a summary of the steps taken and the problems encountered:

Retrieving HTML: The "API (Pabbly) - Execute API Request" action successfully retrieves the HTML source code of www.aldenhoven.de. I have verified this by reviewing the response body in the test run.

Attempting to Extract Data (Initial Regex): My initial attempt with the "Extract Pattern" action used the following regular expression to target the news headlines (which appear to be within <h3> tags with the class SP-SlideTeaser_headline):
Code-Snippet

<h3 class="SP-SlideTeaser_headline">(.*?)</h3>.*?<a\s+href="([^"]*)"[^>]*>(.*?)</a>

This resulted in an "internal server error".

Simplified Regex Tests: To isolate the issue, I tried progressively simpler regular expressions:

<h3[^>]*>(.*?)</h3>: This also resulted in an "internal server error".
<h3[^>]*>: This ran without an error, but the "Result" field was empty, indicating no matches were found.
<h3: This also ran without an error, but the "Result" field remained empty.
<[^>]+> (a very general pattern to match any HTML tag): This also ran without an error, but the "Result" field was empty.

Attempting to Extract the Entire News Block: I then tried to extract the entire <article> block containing each news item using the following regex:
Code-Snippet

<article class="SP-SlideTeaser[^>]*>(.*?)</article>

This attempt also resulted in an "internal server error".

The fact that even the most basic regex patterns (<h3) return no results, while more specific or complex patterns lead to an "internal server error," suggests that there might be an issue beyond the regular expression itself. It's possible that:

Pabbly Connect's "Extract Pattern" function is encountering difficulties processing the specific HTML structure of this website.
There might be limitations in the size or complexity of the HTML being processed.
The website might be using techniques (like dynamic content loading via JavaScript) that result in the HTML fetched by the "API (Pabbly)" action being different from what is rendered in a browser. However, the basic tags like <h3> should still be present in the initial HTML if this were the sole issue.

Could you please investigate why the "Extract Pattern" action is failing (either with no results or an internal server error) when attempting to process the HTML content from www.aldenhoven.de? Are there any known limitations or specific configurations I might be missing? Are there alternative methods within Pabbly Connect that you would recommend for extracting data from websites without an RSS feed in such cases?

Thank you for your time and assistance with this matter.

Sincerely,
Michael
 

Preeti Paryani

Well-known member
Staff member
Hello @Mike4711,

Upon reviewing your workflow, we noticed that the pattern entered is not in regex format. Please ensure you enter a valid regex pattern. For more details, refer to the documentation linked in the help text. Additionally, we recommend testing your regex using the tools mentioned in the docs. If it works there, apply it to your workflow and let us know if you face any issues.


1742891231469.png


1742891256895.png
 
Top