Website Content Crawler avatar
Website Content Crawler

Pricing

Pay per usage

Go to Store
Website Content Crawler

Website Content Crawler

Developed by

Apify

Apify

Maintained by Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

4.0 (41)

Pricing

Pay per usage

1638

Total users

65K

Monthly users

8.5K

Runs succeeded

>99%

Issues response

7.3 days

Last modified

2 days ago

UG

0.3.67 CheerioCrawler shows “Request timeout (0) ms exceeded” despite requestTimeoutSecs being set to 60

Open

uglyrobot opened this issue
10 days ago

I’ve set the configuration option requestTimeoutSecs: 60. However, during execution, I see the following error in logs:

ERROR CheerioCrawler: Request failed and reached maximum retries. Error: impit error: Request timeout (0) ms exceeded.

This suggests the timeout is being treated as 0 ms, even though the configuration specifies 60 seconds. This seems to have started happening more and more often to a significant % of url in any crawl job.

Also as these requests never get added to results like a HTTP 504 for example would, it's hard to debug.

jindrich.bar avatar

Hello, and thank you for your interest in this Actor.

We just tested this and reproduced the issue. Fortunately, this is just a logging problem, i.e., the Actor respects the set timeout value, but logs a wrong number on error.

I filed an issue in GitHub, feel free to follow the progress there.

Thank you for bringing this up!

UG

uglyrobot

9 days ago

Just to be clear these requests are actually timing out at my 60s then? And is that a connection timeout or request timeout?

UG

uglyrobot

7 days ago

Ok I did some debugging. It seems there is some kind of bug with version 0.3.67 of the actor. When using a custom proxy and Cheerio a large percentage of requests always fail with this timeout error. Running the exact same input on version 0.3.66 works fine.

jindrich.bar avatar

Hello again, and thank you for the details regarding this error.

I looked into this a bit more, and it seems that the new HTTP client implementation is indeed not respecting the longer user-set timeouts. I made a PR to the underlying library that fixes this.

Until this is merged (and a new WCC version is released), feel free to pin your runs to the last working version (in your case, that would be 0.3.66).

I'll let you know here once this is resolved.

Thank you for your patience and for providing useful debugging info. Cheers!

UG

uglyrobot

2 days ago

Thanks!