No credit card required

Website Content Crawler

apify/website-content-crawler

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo

Back to issues Create new issue

Not getting PDF Files

Open

laius opened this issue

I am triggering my run using the api:

{ "startUrls": [ { "url": "https://www.bafa.de/DE/Wirtschaft/Handwerk_Industrie/Innovativer_Schiffbau/innovativer_schiffbau_node.html" } ], "maxCrawlDepth": 1, "useSitemaps": false, "saveFiles": true, "includeUrlGlobs": ["*.pdf"] }

The URL has two links leading to PDF documents, which are not being recognized. At least I don't see the links leading to the PDFs when looking at the results table. Nor do I recognise any files being downloaded. What could be the issue?

And: Does apify have a native method of indexing the contents of these PDFs?

Jiří Spilka (jiri.spilka)

Hi, thank you for using Website Content Crawler.

To address your case, you need to update the includeUrlGlobs to:

1"includeUrlGlobs": [
2  {
3    "glob": "**/*.pdf**"
4  }
5]

This configuration instructs the crawler to include also PDF files.

Unfortunately, it’s not working for this specific website. I’m unable to determine the cause at the moment.

I’ll keep this issue open, and we’ll try to investigate further. However, it might take some time before we can revisit this issue. Jiri

Jiří Spilka (jiri.spilka)

Hi, I apologize for the delayed response. We have a fix ready, and it should be released later this week or early next week.

Add comment

Developer

Apify

Actor Metrics

4.1k monthly users
854 stars
>99% runs succeeded
24 hours response time
Created in Mar 2023
Modified 16 hours ago

Categories

Developer tools

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

181

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

114

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

273

Website Screenshot Generator

apify/screenshot-url

Create a screenshot of a website based on a specified URL. The screenshot is stored as the output in a key-value store. It can be used to monitor web changes regularly after setting up the scheduler.

Apify

OpenSearch Integration

apify/opensearch-integration

Transfer data from Apify Actors to Amazon OpenSearch Service. This Actor is a good starting point for building question-answering systems, search functionality, or Retrieval-Augmented Generation (RAG) use cases.

Apify

Example Website Screenshot Crawler

dz_omar/example-website-screenshot-crawler

Automated website screenshot crawler using Pyppeteer and Apify. This open-source actor captures screenshots from specified URLs, uploads them to the Apify Key-Value Store, and provides easy access to the results, making it ideal for monitoring website changes and archiving web content.

Omar Abdlhakim

Web Scraper

apify/web-scraper

Crawls arbitrary websites using the Chrome browser and extracts data from pages using JavaScript code. The Actor supports both recursive crawling and lists of URLs and automatically manages concurrency for maximum performance. This is Apify's basic tool for web crawling and scraping.

Apify

73.9k

342

Google Maps Scraper

compass/crawler-google-places

Extract data from hundreds of Google Maps locations and businesses. Get Google Maps data including reviews, images, contact info, opening hours, location, popular times, prices & more. Export scraped data, run the scraper via API, schedule and monitor runs, or integrate with other tools.

Compass

82.5k

733

📩📍 Google Maps Email Extractor

lukaskrivka/google-maps-with-contact-details

Extract Google Maps contact details. Scrape websites of Google Maps places for contact details and get email addresses, website, location, address, zipcode, phone number, social media links. Export scraped data, run the scraper via API, schedule and monitor runs or integrate with other tools.

Lukáš Křivka

11.5k

337

Smart Article Extractor

lukaskrivka/article-extractor-smart

📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

Lukáš Křivka

4.3k