Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

View all Actors
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
OL

Unable to get any data from certain websites

Closed

ollieiq opened this issue
2 months ago

I have tried different runs crawling https://www.netgear.com/support/ using max depth of 0, 4 and 10 and the crawler is unable to pull any data at all. There is 0 output. I have the same problem with https://www.asus.com/us/support/ runs as well. Please advise me on any settings changes I should try to get at least some data. Having the crawler continually fail over and over without clearer reasons why is not that helpful for me. :) I can link you to the runs I let go longer if that is helpful (here is one that ran for 32 minutes with no output before I aborted it) https://console.apify.com/view/runs/VgAjVAU3nHtBoVdeg. I also tried changing the proxy from data center to residential between runs and the did not help. Thank you!

janbuchar avatar

Hello, and thank you for your interest in Website Content Crawler!

I noticed that your input contains the following:

1"excludeUrlGlobs": [
2    {
3      "glob": ""
4    }
5  ],
6  "includeUrlGlobs": [
7    {
8      "glob": "https://www.netgear.com/support/product/**"
9    },
10    {
11      "glob": "https://kb.netgear.com/**"
12    },
13    {
14      "glob": ""
15    }
16  ],

Is there some intent behind the empty globs or were they added by mistake? I suspect that this may cause the problem that you're describing.

OL

ollieiq

a month ago

Hello Jan, I do not think that is the issue. I just tried running the crawler again with no globs and the Output is still 0. The logs show the Navigation is timing out. Here is the run I just tried without any globs. https://console.apify.com/view/runs/JEQOKSOBF5fTWkR4K

2024-08-05T16:21:47.683Z ACTOR: Pulling Docker image of build FDzsyicgYoKxjvvel from repository. 2024-08-05T16:21:47.820Z ACTOR: Creating Docker container. 2024-08-05T16:21:48.179Z ACTOR: Starting Docker container. 2024-08-05T16:21:49.538Z Starting X virtual framebuffer using: Xvfb :99 -ac -screen 0 1920x1080x24+32 -nolisten tcp 2024-08-05T16:21:49.540Z Executing main command 2024-08-05T16:21:52.367Z INFO System info {"apifyVersion":"3.2.3","apifyClientVersion":"2.9.0","crawleeVersion":"3.11.1","osType":"Linux","nodeVersion":"v18.19.1"} 2024-08-05T16:21:53.753Z INFO AdaptiveCrawler: Starting the crawler. 2024-08-05T16:21:53.985Z INFO AdaptiveCrawler: Running browser request handler for https://www.netgear.com/support/ 2024-08-05T16:21:54.003Z INFO HttpCrawler: Starting the crawler. 2024-08-05T16:22:53.753Z INFO AdaptiveCrawler:Statistics: AdaptiveCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":60959,"retryHistogram":[]} 2024-08-05T16:22:53.786Z INFO AdaptiveCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} 2024-08-05T16:22:54.004Z INFO HttpCrawler:Statistics: HttpCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":60363,"retryHistogram":[]} 2024-08-05T16:22:54.037Z INFO HttpCrawler:AutoscaledPool: state {"currentConcurrency":0,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.7,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} 2024-08-05T16:22:57.951Z WARN AdaptiveCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"nq2tVFaFuGjkuJn","url":"https://www.netgear.com/support/","retryCount":1} 2024-08-05T16:22:58.260Z INFO AdaptiveCrawler: Running browser request handler for https://www.netgear.com/support/ 2024-08-05T16:23:53.753Z INFO AdaptiveCrawler:Statistics: AdaptiveCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":120959,"retryHistogram":[]} 2024-08-05T16:23:53.787Z INFO AdaptiveCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} 2024-08-05T16:23:54.004Z INFO HttpCrawler:Statistics: HttpCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":120363,"retryHistogram":[]} 2024-08-05T16:23:54.041Z INFO HttpCrawler:AutoscaledPool: state {"currentConcurrency":0,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.7,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} 2024-08-05T16:24:01.643Z WARN AdaptiveCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"nq2tVFaFuGjkuJn","url":"https://www.netgear.com/support/","retryCount":2} 2024-08-05T16:24:01.838Z INFO AdaptiveCrawler: Running browser request handler for https://www.netgear.com/support/

janbuchar avatar

Interesting. I just reran the exact same configuration - https://console.apify.com/view/runs/lSXNipEmYYevCwKl6 - and I got a bunch of results (I aborted the run after like 7 requests). Does this happen for all your runs of Website Content Crawler?

OL

ollieiq

a month ago

It does not happen for other sites I tried. I did increase the timeout threshold for the netgear site to 90 seconds and was able to get results. It is interesting you kept the timeout at 60 seconds and were able to get results. I don't know what could explain that other than the site might be busier when I try to access it vs. when you try to access it?

janbuchar avatar

I guess it might be that, even though it sounds suspicious.

OL

ollieiq

a month ago

As it appears to be working now, I will close the issue and open a different one if a similar problem occurs again.

Developer
Maintained by Apify
Actor metrics
  • 3k monthly users
  • 465 stars
  • 99.9% runs succeeded
  • 3.1 days response time
  • Created in Mar 2023
  • Modified 10 days ago