Smart Article Extractor
No credit card required
Smart Article Extractor
No credit card required
📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.
Do you want to learn more about this Actor?
Get a demoCould you please help understand how to pass timeout parameter using API? I never see it reflected, even when using the web interface. Additionally, I believe it occasionally fails to check the articles that are already stored.
i have used this actor for many months now, and in the recent weeks, some of my runs have been running indefinitely. I don-t know what happened, but I didn't have to monitor every day because the results were as expected. Now I've had some terrible experiences with the actor running infinitely, costing me loads of money.
Edit: these two runs have identical input:
- https://console.apify.com/actors/runs/TrPjXGgDm5PHNiM48#output - 3,099 results. (Not desired)
- https://console.apify.com/actors/runs/SJcFnVa96s3OFZy2w#output - 66 results. (Desired).
Hey,
when starting a new run through API, you can define the timeout with query parameter. For example, to start a run with timeout 60 seconds: POST https://api.apify.com/v2/acts/<ACTOR>/runs?timeout=60
. Here's the docs.
The issue with the run is that the website changed the urls of the articles - they added ?amp
query parameter. This caused that the actor scraped them again because their URL was different from those there were already stored. We'll need to figure out how to avoid situatoins like this. We cannot ignore the query parameters completely because they may define the article (e.g. ?articleId=edasdas
), so this will require more thinking. Thank you for the report, we'll let you know when we have any updates on this. Let me know if you have any questions.
Actor Metrics
277 monthly users
-
82 stars
>99% runs succeeded
2.3 days response time
Created in Nov 2019
Modified a month ago