Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

View all Actors
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
BL

Hi my run didnt work

Closed

ballerine opened this issue
10 days ago

even though the site is live and we use residential proxies, the scarper failed

Oscardz avatar

Hello, I see the issue. In the HTML processing, you should change the "HTML transformer" to the default, which is "Readable text." I will close the ticket. If you need anything else, let us know.

BL

ballerine

10 days ago

Im looking for this configuration, but i wonder what is different with the site? I've ran thousands of scraping jobs with the actor with the same configuration

BL

ballerine

10 days ago

Ok the configuration we chose is correct, we use "HTML transformer": null to documentation says:

None - Only removes the HTML elements specified via 'Remove HTML elements' option.

which is what we want, we want to specify elements to omit - and again, it worked for thousands of sites we already ran with the actor.

Can you please check why this specific configuration is failing for this site?

Oscardz avatar

After further investigation, we discovered that the issue was not in the HTML transformer but in the timeout. Timeouts can be for multiple reasons, such as slow content rendering, server overload, residential proxies being slower, etc. If you want to keep the same input setup, our recommendation is to increase the 'requestTimeoutSecs'.

BL

ballerine

7 days ago

Worked. thanks, closing the issue.

Developer
Maintained by Apify
Actor metrics
  • 3k monthly users
  • 465 stars
  • 99.9% runs succeeded
  • 3.1 days response time
  • Created in Mar 2023
  • Modified 10 days ago