Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

Go to Store
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo

Changelog

All notable changes to this project will be documented in this file.

0.3.58

🚀 Features

  • Push latest to Apify with GH actions, add auto changelog bumps (#347)
  • More lenient behaviour on missing content-type response header (#353)
  • Improve antiblocking performance (#354)

0.3.57 (2024-12-13)

This release only contains documentation changes in readme and input schema.

0.3.56 (2024-11-25)

  • Input:
    • Empty includeUrlGlobs are now filtered out with a warning log message. To enforce the old behavior (i.e. matching everything), use ** instead.
  • Behaviour:
    • The Actor now automatically drops the request queue associated with the file download.
    • Only the first <title> element on the page is prepended to the exported content.
    • The crawler now uses the correct scope with all startUrls being sitemaps.
    • Sitemaps are now processed in a separate thread. The 30 seconds per sitemap request limit is now strongly enforced.

0.3.55 (2024-11-11)

  • Behaviour:
    • The sitemap timeout warning is only logged the first time.

0.3.54 (2024-11-07)

  • Input:

    • The default removeElementsCssSelector now removes img elements with data: URLs to prevent cluttering the text output.
  • Behaviour:

    • expandIframes now skips broken iframe elements instead of failing the whole request.
    • Actor now parses formatted (indented or newline-separated) sitemaps correctly.
    • The sitemap discovery process is now parallelized. Logging is improved to show the progress of the sitemap discovery.
    • Sitemap processing now has stricter per-URL time limits to prevent indefinite hangs.

0.3.53 (2024-10-22)

  • Input:
    • New keepElementsCssSelector accepts a CSS selector targetting elements to extract from the page to the output.
  • Behaviour:
    • Actor optimizes RQ writes by following the maxCrawlPages limits better.

0.3.52 (2024-10-10)

  • Behavior:
    • Handle sitemap-based requests correctly. Solves the indefinite RequestQueue hanging read issue.

0.3.51 (2024-10-10)

  • Behavior:
    • Revert internal library update to mitigate the indefinite RequestQueue hanging read issue.

0.3.50 (2024-10-07)

  • Behavior:
    • Actor terminates sitemap loading prematurely in case of a timeout.
    • Sitemap loading now respects maxRequestRetries option.

0.3.49 (2024-09-23)

  • Behavior:
    • Use the correct proxy settings when loading sitemap files.
    • Mitigate sitemap persistence issues with premature stopping on maxRequests (ERR_STREAM_PUSH_AFTER_EOF).

0.3.48 (2024-09-10)

  • Input:
    • useSitemaps option is now pre-filled to true to automatically enable it for new users and in API examples.

0.3.47 (2024-09-04)

  • Behaviour:
    • Use crawlee 3.11.3 which should help with the crawler being stuck because of some stale requests in the queue.

0.3.46 (2024-08-30)

  • Behaviour:
    • Process markdown in a separate worker thread so it won't block the main process on too large pages.
    • Sitemap loading with useSitemaps doesn't block indefinitely on large sitemaps anymore.

0.3.45 (2024-08-20)

  • Input:
    • New keepFragmentUrls (URL #fragments identify unique pages) input option to consider fragment URLs as separate pages.
  • Behaviour:
    • Ensure canonical URL is only taken from the main page and not embedded pages from <iframe>.
    • Too large dataset items used to fail, now we retry with a trimmed payload (html, text and markdown fields are trimmed to the first three million characters each).

0.3.44 (2024-07-30)

  • Behaviour:
    • waitForSelector option allows users to specify a CSS selector to wait for before extracting the page content. This is useful for pages that load content dynamically and break the automatic waiting mechanism.

0.3.43 (2024-07-24)

  • Behaviour:
    • Change the shadow DOM expansion logic to handle edge cases better.

0.3.42 (2024-07-12)

  • Behaviour:
    • Fix edge cases with the improved startUrls sanitization.

0.3.41 (2024-07-11)

  • Behaviour:
    • Better input URLs sanitization to prevent issues with the startUrls input.

0.3.40 (2024-07-10)

  • Input:
    • New expandIframes (Expand iframe elements) option for extracting content from on-page iframe elements. Available only in playwright:firefox.

0.3.39 (2024-06-28)

  • Behaviour:
    • Mitigating excessive Request Queue writes in some cases.

0.3.38 (2024-06-25)

  • Behaviour:
    • The saveScreenshots option now correctly prints warnings with crawler types that don't support screenshots.
    • The screenshot KVS key now contains the website hostname and a hash of the original URL to avoid collisions.

0.3.37 (2024-06-17)

  • Behaviour:
    • The Actor now respects the advanced request configuration passed through the Start URLs input.

0.3.36 (2024-06-10)

  • Input:
    • New saveHtmlAsFile option is available which enables storing the HTML into a key-value store, replacing their values in the dataset with links to make the dataset value smaller, since there is a hard limit for its size.
    • Deprecated saveHtml in favor of saveHtmlAsFile.
  • Output:
    • The new saveHtmlAsFile option saves the URL under a new htmlUrl key in the dataset.
  • Behaviour:
    • HTML processors don't block the main thread and can safely time out.
    • Better fallback logic for the HTML processing pipeline.
    • When pushing to dataset, we now detect too large payload and skip retrying (while suggesting to use the new saveHtmlAsFile option to get around this problem).

0.3.35 (2024-05-23)

  • Behaviour:
    • RequestQueue race condition fixed.
  • Output:
    • The Readable text extractor now correctly handles the article titles.

0.3.34 (2024-05-17)

  • Behaviour:
    • If any of the Start URLs lead to a sitemap file, it is processed and the links are enqueued.
    • Performance / QoL improvements (see Crawlee 3.10.0 changelog for more details)

0.3.33 (2024-04-22)

  • Input:
    • AdaptiveCrawler is the new default (prefill) crawler type.
  • Behaviour:
    • The use of Chrome + Chromium browsers was deprecated. The Actor now uses only Firefox internally for the browser-based crawlers.
    • Reimplementation of the file download feature for better stability and performance.
    • On smaller websites, the AdaptiveCrawler skips the adaptive scanning to speed up the crawl.

0.3.32 (2024-03-28)

  • Output:
    • The Actor now stores metadata.headers with the HTTP response headers of the crawled page.

0.3.31 (2024-03-14)

  • Input:
    • Invalid startUrls are now filtered out and don't cause the actor to fail.

0.3.30 (2024-02-24)

  • Behavior:
    • File download now respects excludeGlobs and includeGlobs input options, stores filenames, and understands Content-Disposition HTTP headers (i.e. "forced download").
  • Output:
    • Better og: metatags coverage (article:, movie: etc.).
    • Stores JSON-LD metatags in the metadata field.

0.3.29 (2024-02-05)

  • Input:
    • useSitemaps toggle for sitemap discovery - leads to more consistent results, scrapes also unreachable webpages.
    • Do not fail on empty globs in input (ignore them instead).
    • Experimental playwright:adaptive crawling mode
  • Output:
    • Add metadata.openGraph output field for contents of og:* metatags.

0.3.27 (2024-01-25)

  • Input:
    • maxRequestRetries input option for limiting the number of request retries on network, server or parsing errors.
  • Behavior:
    • Allow large lists of start URLs with deep crawling (maxCrawlDepth > 0), as the memory overflow issue from 0.3.18 is now fixed.

0.3.26 (2024-01-02)

  • Output:
    • The Actor now stores metadata.mimeType for downloaded files (only applicable when saveFiles is enabled).

0.3.25 (2023-12-21)

  • Input:
    • Add maxSessionRotations input option for limiting the number of session rotations when recognized as a bot.
  • Behavior:
    • Fail on 401, 403 and 429 HTTP status codes.

0.3.24 (2023-12-07)

  • Behavior:
    • Fix a bug within the expandClickableElements utility function.

0.3.23 (2023-12-06)

  • Behavior:
    • Respect empty <body> tag when extracting text from HTML.
    • Fix a bug with the simplifiedBody === null exception.

0.3.22 (2023-12-04)

  • Input:
    • ignoreCanonicalUrl toggle to deduplicate pages based on their actual URLs (useful when two different pages share the same canonical URL).
  • Behavior:
    • Improve the large content detection - this fixes a regression from 0.3.21.

0.3.21 (2023-11-29)

  • Output:
    • The debug mode now stores the results of all the extractors (+ raw HTML) as Key-Value Store objects.
    • New extractor "Readable text with fallback" checks the results of the "Readable text" extractor and checks the content integrity on the fly.
  • Behavior:
    • Skip text-cleaning and markdown processing step on large responses to avoid indefinite hangs.

0.3.20 (2023-11-08)

  • Output:
    • The debug mode now stores the raw page HTML (without the page preprocessing) under the rawHtml key.

0.3.19 (2023-10-18)

  • Input:
    • Add a default for proxyConfiguration option (which is now required since 0.3.18). This fixes the actor usage via API, falling back to the default proxy settings if they are not explicitly provided.

0.3.18 (2023-10-17)

  • Input:
    • Adds includeUrlGlobs option to allow explicit control over enqueuing logic (overrides the default scoping logic).
    • Adds requestTimeoutSecs option to allow overriding the default request processing timeout.
  • Behavior:
    • Disallow using large list of start URLs (more than 100) with deep crawling (maxCrawlDepth > 0) as it can lead to memory overflow.

0.3.17 (2023-10-05)

  • Input:
    • Adds debugLog option to enable debug logging.

0.3.16 (2023-09-06)

  • Behavior:
    • Raw HTTP Client (Cheerio) now works correctly with proxies again.

0.3.15 (2023-08-30)

  • Input:
    • startUrls is now a required input field.
    • Input tooltips now provide more detailed description of crawler types and other input options.

0.3.14 (2023-07-19)

  • Behavior:
    • When using the Cheerio based crawlers, the actor processes links from removed elements correctly now.
    • Crawlers now follow links in <link> tags (rel=next, prev, help, search).
    • Relative canonical URLs are now getting correctly resolved during the deduplication phase.
    • Actor now automatically recognizes blocked websites and retries the crawl with a new proxy / fingerprint combination.

0.3.13 (2023-06-30)

  • Input:
    • The Actor now uses new defaults for input settings for better user experience:
      • default crawlerType is now playwright:firefox
      • saveMarkdown is true
      • removeCookieWarnings is true

0.3.12 (2023-06-14)

  • Input:
    • add excludeUrlGlobs for skipping certain URLs when enqueueing
    • add maxScrollHeightPixels for scrolling down the page (useful for dynamically loaded pages)
  • Behavior:
    • by default, the crawler scrolls down on every page to trigger dynamic loading (disable by setting maxScrollHeightPixels to 0)
    • the crawler now handles HTML processing errors gracefully
    • the actor now stays alive and restarts the crawl on certain known errors (Playwright Assertion Error).
  • Output:
    • crawl.httpStatusCode now contains HTTP response status code.

0.3.10 (2023-06-05)

  • Input:
    • move processedHtml under debug object
    • add removeCookieWarnings option to automatically remove cookie consent modals (via I don't care about cookies browser extension)
  • Behavior:
    • consider redirected start URL for prefix matching
    • make URL deduplication case-insensitive
    • wait at least 3s in playwright to let dynamic content to load
    • retry the start URLs 10 times and regular URLs 5 times (to get around issues with retries on burnt proxies)
    • ignore links not starting with http
    • skip parsing non-html files
    • support startUrls as text-file

0.3.9 (2023-05-18)

  • Input:
    • Updated README and input hints.

0.3.8 (2023-05-17)

  • Input:
    • initialCookies option for passing cookies to the crawler. Provide the cookies as a JSON array of objects with "name" and "value" keys. Example: [{ "name": "token", "value": "123456" }].
  • Behavior:
    • textExtractor option is now removed in favour of htmlTransformer
    • unfluff extractor has been completely removed
    • HTML is always simplified by removing some elements from it, those are configurable via removeElementsCssSelector option, which now defaults to a larger set of elements, including <nav>, <footer>, <svg>, and elements with role attribute set to one of alert, banner, dialog, alertdialog.
    • New htmlTransformer option has been introduced which allows to configure how the simplified HTML is further processed. The output of this is still an HTML, which can be later used to generate markdown or plain text from it.
    • The crawler will now try to expand collapsible sections automatically. This works by clicking on elements with aria-expanded="false" attribute. You can configure this selector via clickElementsCssSelector option.
    • When using Playwright based crawlers, the dynamic content is awaited for based on network activity, rather than webpage changes. This should improve the reliability of the crawler.
    • Firefox SEC_ERROR_UNKNOWN_ISSUER has been solved by preloading the recognized intermediate TLS certificates into the Docker image.
    • Crawled URLs are retried in case their processing timeouts.

0.3.7 (2023-05-10)

  • Behavior:
    • URLs with redirects are now enqueued based on the original (unredirected) URL. This should prevent the actor from skipping relevant pages which are hidden behind redirects.
    • The actor now considers all start URLs when enqueueing new links. This way, the user can specify multiple start URLs as a workaround for the actor skipping some relevant pages on the website.
    • Error with not enqueueing URLs with certain query parameters is now fixed.
  • Output:
    • The .url field now contains the main resource URL without the fragment (#) part.

0.3.6 (2023-05-04)

  • Input:
    • Made the initialConcurrency option visible in the input editor.
    • Added aggressivePruning option. With this option set to true, the crawler will try to deduplicate the scraped content. This can be useful when the crawler is scraping a website with a lot of duplicate content (header menus, footers, etc.)
  • Behavior:
    • The actor now stays alive and restarts the crawl on certain known errors (Playwright Assertion Error).

0.3.4 (2023-05-04)

  • Input:
    • Added a new hidden option initialConcurrency. This option sets the initial number of web browsers or HTTP clients running in parallel during the actor run. Increasing this number can speed up the crawling process. Bear in mind this option is hidden and can be changed only by editing the actor input using the JSON editor.

0.3.3 (2023-04-28)

  • Input:
    • Added a new option maxResults to limit the total number of results. If used with maxCrawlPages, the crawler will stop when either of the limits is reached.

0.3.1 (2023-04-24)

  • Input:
    • Added an option to download linked document files from the page - saveFiles. This is useful for downloading pdf, docx, xslx... files from the crawled pages. The files are saved to the default key-value store of the run and the links to the files are added to the dataset.
    • Added a new crawler - Stealthy web browser - that uses a Firefox browser with a stealthy profile. It is useful for crawling websites that block scraping.

0.0.13 (2023-04-18)

  • Input:
    • Added new textExtractor option readableText. It is generally very accurate and has a good ratio of coverage to noise. It extracts only the main article body (similar to unfluff) but can work for more complex pages.
    • Added readableTextCharThreshold option. This only applies to readableText extractor. It allows fine-tuning which part of the text should be focused on. That only matters for very complex pages where it is not obvious what should be extracted.
  • Output:
    • Added simplified output view Overview that has only url and text for quick output check
  • Behavior:
    • Domains starting with www. are now considered equal to ones without it. This means that the start URL https://apify.com can enqueue https://www.apify.com and vice versa.

0.0.10 (2023-04-05)

  • Input:
    • Added new crawlerType option jsdom for processing with JSDOM. It allows client-side script processing, trying to mimic the browser behavior in Node.js but with much better performance. This is still experimental and may crash on some particular pages.
    • Added dynamicContentWaitSecs option (defaults to 10s), which is the maximum waiting time for dynamic waiting.
  • Output (BREAKING CHANGE):
    • Renamed crawl.date to crawl.loadedTime
    • Moved crawl.screenshotUrl to top-level object
    • The markdown field was made visible
    • Renamed metadata.language to metadata.languageCode
    • Removed metadata.createdAt (for now)
    • Added metadata.keywords
  • Behavior:
    • Added waiting for dynamically rendered content (supported in Headless browser and JSDOM crawlers). The crawler checks every half a second for content changes. When there are no changes for 2 seconds, the crawler proceeds to extraction.

0.0.7 (2023-03-30)

  • Input:
    • BREAKING CHANGE: Added textExtractor input option to choose how strictly to parse the content. Swapped the previous unfluff for CrawleeHtmlToText as default which in general will extract more text. We chose to output more text rather than less by default.
    • Added removeElementsCssSelector which allows passing extra CSS selectors to further strip down the HTML before it is converted to text. This can help fine-tuning. By default, the actor removes the page navigation bar, header, and footer.
  • Output:
    • Added markdown to output if saveMarkdown option is chosen
    • All extractor outputs + HTML as a link can be obtained if debugMode is set.
    • Added pageType to the output (only as debug for now), it will be fine-tuned in the future.
  • Behavior:
    • Added deduplication by canonicalUrl. E.g. if more different URLs point to the same canonical URL, they are skipped
    • Skip pages that redirect outside the original start URLs domain.
    • Only run a single text extractor unless in debug mode. This improves performance.
Developer
Maintained by Apify

Actor Metrics

  • 4.1k monthly users

  • 854 stars

  • >99% runs succeeded

  • 24 hours response time

  • Created in Mar 2023

  • Modified 14 hours ago