I am encountering persistent issues while trying to scrape news headlines and links from the website www.aldenhoven.de and integrate them into a Google Sheet using Pabbly Connect. The website does not offer an RSS feed, so we are attempting to use the "API (Pabbly) - Execute API Request" action to retrieve the HTML content, followed by the "Text Formatter by Pabbly" - "Extract Pattern" action to extract the relevant data.
Here's a summary of the steps taken and the problems encountered:
Retrieving HTML: The "API (Pabbly) - Execute API Request" action successfully retrieves the HTML source code of www.aldenhoven.de. I have verified this by reviewing the response body in the test run.
Attempting to Extract Data (Initial Regex): My initial attempt with the "Extract Pattern" action used the following regular expression to target the news headlines (which appear to be within <h3> tags with the class SP-SlideTeaser_headline):
Code-Snippet
<h3 class="SP-SlideTeaser_headline">(.*?)</h3>.*?<a\s+href="([^"]*)"[^>]*>(.*?)</a>
This resulted in an "internal server error".
Simplified Regex Tests: To isolate the issue, I tried progressively simpler regular expressions:
<h3[^>]*>(.*?)</h3>: This also resulted in an "internal server error".
<h3[^>]*>: This ran without an error, but the "Result" field was empty, indicating no matches were found.
<h3: This also ran without an error, but the "Result" field remained empty.
<[^>]+> (a very general pattern to match any HTML tag): This also ran without an error, but the "Result" field was empty.
Attempting to Extract the Entire News Block: I then tried to extract the entire <article> block containing each news item using the following regex:
Code-Snippet
<article class="SP-SlideTeaser[^>]*>(.*?)</article>
This attempt also resulted in an "internal server error".
The fact that even the most basic regex patterns (<h3) return no results, while more specific or complex patterns lead to an "internal server error," suggests that there might be an issue beyond the regular expression itself. It's possible that:
Pabbly Connect's "Extract Pattern" function is encountering difficulties processing the specific HTML structure of this website.
There might be limitations in the size or complexity of the HTML being processed.
The website might be using techniques (like dynamic content loading via JavaScript) that result in the HTML fetched by the "API (Pabbly)" action being different from what is rendered in a browser. However, the basic tags like <h3> should still be present in the initial HTML if this were the sole issue.
Could you please investigate why the "Extract Pattern" action is failing (either with no results or an internal server error) when attempting to process the HTML content from www.aldenhoven.de? Are there any known limitations or specific configurations I might be missing? Are there alternative methods within Pabbly Connect that you would recommend for extracting data from websites without an RSS feed in such cases?
Thank you for your time and assistance with this matter.
Sincerely,
Michael
Here's a summary of the steps taken and the problems encountered:
Retrieving HTML: The "API (Pabbly) - Execute API Request" action successfully retrieves the HTML source code of www.aldenhoven.de. I have verified this by reviewing the response body in the test run.
Attempting to Extract Data (Initial Regex): My initial attempt with the "Extract Pattern" action used the following regular expression to target the news headlines (which appear to be within <h3> tags with the class SP-SlideTeaser_headline):
Code-Snippet
<h3 class="SP-SlideTeaser_headline">(.*?)</h3>.*?<a\s+href="([^"]*)"[^>]*>(.*?)</a>
This resulted in an "internal server error".
Simplified Regex Tests: To isolate the issue, I tried progressively simpler regular expressions:
<h3[^>]*>(.*?)</h3>: This also resulted in an "internal server error".
<h3[^>]*>: This ran without an error, but the "Result" field was empty, indicating no matches were found.
<h3: This also ran without an error, but the "Result" field remained empty.
<[^>]+> (a very general pattern to match any HTML tag): This also ran without an error, but the "Result" field was empty.
Attempting to Extract the Entire News Block: I then tried to extract the entire <article> block containing each news item using the following regex:
Code-Snippet
<article class="SP-SlideTeaser[^>]*>(.*?)</article>
This attempt also resulted in an "internal server error".
The fact that even the most basic regex patterns (<h3) return no results, while more specific or complex patterns lead to an "internal server error," suggests that there might be an issue beyond the regular expression itself. It's possible that:
Pabbly Connect's "Extract Pattern" function is encountering difficulties processing the specific HTML structure of this website.
There might be limitations in the size or complexity of the HTML being processed.
The website might be using techniques (like dynamic content loading via JavaScript) that result in the HTML fetched by the "API (Pabbly)" action being different from what is rendered in a browser. However, the basic tags like <h3> should still be present in the initial HTML if this were the sole issue.
Could you please investigate why the "Extract Pattern" action is failing (either with no results or an internal server error) when attempting to process the HTML content from www.aldenhoven.de? Are there any known limitations or specific configurations I might be missing? Are there alternative methods within Pabbly Connect that you would recommend for extracting data from websites without an RSS feed in such cases?
Thank you for your time and assistance with this matter.
Sincerely,
Michael