Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

View all Actors
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
OL

Not able to download any pages or files

Closed

ollieiq opened this issue
3 months ago

The logs show no links are found on the page(s) I am trying to scrape. Please advise me on different settings I can try to get user manuals from this site. I have tried toggling sitemap and canonical URL off and on, depth of up to 6 pages and using residential and datacenter proxies. I have tried adding in additional URLs on other runs but no manuals are found.

jindrich.bar avatar

Hello, and thank you for your interest in this Actor!

You've just found out about one of the limitations of the Website Content Crawler. While it can load dynamically loaded data, it cannot interact with JS-based pagination. This is because there is no standardized way to implement this - in this way, WCC is similar to the Google Crawler, which also claims not to process these. Implementing pagination on your website like this can then possibly hurt your SEO performance - yet, some websites still do it, as you found out.

In case you want to make this automatic (and you're e.g. expecting the page to add new links there), you can check out our Web Scraper. You can execute custom code in this Actor, which allows you to click on the page elements and navigate the website using this interaction.

While I understand I didn't exactly solve your problem, I'll close this issue - that's our usual process with the JS-navigation-based issues, as there is very little to do (without turning the WCC into a very website-specific tool, which we don't want to).

Either way, feel free to suggest any ideas or ask any additional questions, be it here, or in new issues. Thanks!

Developer
Maintained by Apify
Actor metrics
  • 3.8k monthly users
  • 616 stars
  • 99.9% runs succeeded
  • 3.4 days response time
  • Created in Mar 2023
  • Modified 3 days ago