Cheerio Scraper
No credit card required
Cheerio Scraper
No credit card required
Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.
Do you want to learn more about this Actor?
Get a demoThis scraper works mostly fine, but for some sites (example https://uggrenew.com/) it fails almost immediately with the below error. Is there a way to fix this in the scraper? Or is this due to some setting in the site (I have access to the site so we might be able to fix that)? Help appreciated.
WARN CheerioCrawler: Reclaiming failed request back to the list or queue. Detected a session error, rotating session... 2023-09-16T03:11:30.669Z Proxy responded with 590 UPSTREAM502: 0 bytes 2023-09-16T03:11:30.670Z 2023-09-16T03:11:30.671Z {"id":"btKVLbPR2NzCFj5","url":"https://uggrenew.com/","retryCount":1}
Hello, looking into the logs, this looks like some kind of a temporary server error or bad proxy luck. I just managed to run the actor with identical input without any problems.
If the server is dead, the crawler run will always fail - but in case this is caused by bad proxies, there are a few things you can try:
- set the proxy setting to
Automatic proxy
- set the proxy rotation option to
Use recommended settings
- unset the proxy country (the crawler will have a larger proxy pool to pick from) - I see that this is the only thing you haven't tried yet :)
Unfortunately, without reproducing this issue, I cannot really provide more help right now.
Can you please go through the above steps and confirm whether they helped? Thank you!
Thanks for getting back to me!
I've followed your instructions, including unsetting the proxy country. Tried a couple of times (see for example runs QTDp6S4dk02iYvJos
and lCJRIlByoOqf3Q0qb
), but I keep getting this specific error. And it's just for https://uggrenew.com/ (this is a client site we need to scrape - all other client sites so far have worked fine)
So you're saying that running with the same inputs (eg for https://uggrenew.com/) works fine on your side? Let me know if you have a dataset_id that you could share. Could it be that there is some caching going on on my account which causes it to always fail after it failed once?
Hello, sorry for taking longer - I have looked into your runs and tried running them under my account with the exact same input once again - here are the results (https://console.apify.com/view/runs/gFCbuRzozEINfZQce).
Looking into your account, I see that you have only two Proxy groups enabled (BUYPROXIES94952
and StaticUS3
) - this is based on your plan. The "Automatic" proxy option can only choose from those two groups. I tried running the Cheerio Scraper under my account with those proxies and finally got the same error as you did 🎉
I can also see that you have the RESIDENTIAL
proxies available. These are larger proxy groups with IP addresses from the consumer ranges, so they are very hard for webmasters to block. If you route the scraper traffic through these groups (by choosing Proxy and HTTP Configuration > Selected Proxies > RESIDENTIAL
), you should finally be able to crawl the uggrenew.com
page.
Keep in mind that the residential proxies come at a higher price than the datacenter ones, though. I would advice you to use them only when needed (just like in this case). You can learn more about the proxies and different pricing at https://apify.com/proxy.
Once again, sorry for the delay, thank you for your patience and let us know whether this has solved your issue. Thanks!
Closing due to inactivity.
Actor Metrics
455 monthly users
-
94 stars
>99% runs succeeded
31 days response time
Created in Apr 2019
Modified 3 months ago