Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

View all Actors
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
MV

Crawler does not identify relative links

Open

MavenAGI opened this issue
a month ago

when trying to crawl oracle netsuite documentation:

https://docs.oracle.com/en/cloud/saas/netsuite/index.html

Only the first page gets crawled. Renders correctly. Has links, but they are not followed. Crawl only returns the first page

jindrich.bar avatar

Hello, and thank you for your interest in this Actor!

This is likely caused by Website Content Crawler's default behavior, which only follows links leading to "descendants" of the input start URLs. You can override this behavior by using the Crawler settings > Include URLs (globs) option, which allows you to specify glob patterns that the links have to match in order to be followed.

I set up another run of the Actor with the https://docs.oracle.com/en/cloud/saas/netsuite/**/* URL glob - and the Cheerio crawler type. The default Firefox browser was taking too much time - and the website you are trying to scrape seems to load just fine without the client-side JavaScript execution. Please check out my run with these settings and check that no content is missing. If you are satisfied with the results, feel free to use the input of this run to start a full scrape under your account.

Did this solve your problem? If so, feel free to close this issue - if not, you can always ask additional questions in comments here. Cheers!

Developer
Maintained by Apify
Actor metrics
  • 3k monthly users
  • 465 stars
  • 99.9% runs succeeded
  • 3.1 days response time
  • Created in Mar 2023
  • Modified 10 days ago