Website Content Crawler avatar
Website Content Crawler

Pricing

Pay per usage

Go to Store
Website Content Crawler

Website Content Crawler

Developed by

Apify

Apify

Maintained by Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

4.0 (41)

Pricing

Pay per usage

1638

Total users

65K

Monthly users

8.5K

Runs succeeded

>99%

Issues response

7.3 days

Last modified

2 days ago

DR

Crawler is timing out for root url

Open

adarshkm opened this issue
5 days ago

When Crawling, behaviour is totally random. Sometime is able to scrap text and some time just failed with 403 errors, even in multiple retry.

https://donate.unrwa.org/

jindrich.bar avatar

Hello Adarsh, and thank you for reporting this.

The run you’re referring to timed out due to a backend issue on our side - sorry about that. The backend was overloaded because of a rolling release and didn't find a suitable timeslot for scheduling your run. This is a very rare issue.

As for the inconsistent behavior and 403 errors you're seeing in other runs - these issues can depend on several factors, such as:

  • The crawler mode being used (e.g., browser-based vs. Cheerio)
  • Whether you're using proxies, and what kind
  • Specific settings in your Actor (like timeouts, clicked selectors etc.)

To help you further, we'll need a bit more context. Could you please share:

  • The Actor input or settings you're using
  • Whether you're using proxies, and which type
  • A couple of run links where the 403s or unexpected behavior happened

Once we have that, we can take a closer look and guide you more effectively. Looking forward to your reply!