No credit card required

Website Content Crawler

apify/website-content-crawler

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo

Back to issues Create new issue

Download RSS Feeds

Closed

carlson opened this issue

Is it possible to download RSS feeds, which are in XML, with this actor?

I've tried and the run succeeded, but there was no file downloaded.

The settings I've used are: "startUrls": [ { "url": "https://example.com/rss-feed" } ], "maxCrawlDepth": 0, "maxCrawlPages": 1, "maxRequestRetries": 3, "crawlerType": "cheerio", "saveHtmlAsFile": true, "saveHtml": false, "saveFiles": false, "saveMarkdown": false, "saveScreenshots": false, "useSitemaps": false, "removeCookieWarnings": true

In the Key-Value Store, there are only 4 keys: CRAWLEE_STATE, INPUT, SDK_CRAWLER_STATISTICS_0, and SDK_SESSION_POOL_STATE.

Thanks.

Jindřich Bär (jindrich.bar)

Hello, and thank you for your interest in this Actor!

Website Content Crawler was made to "crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines" (see the Actor description in the README).

If all you need is to download (and parse) the RSS feed, then you can use one of the existing Actors on Apify Store (search for RSS).

If those do not fulfill your needs, you can also use Cheerio Scraper for parsing the RSS feed manually - see my example run and feel free to reuse the input if you like.

Both of these solutions will also be much faster than anything you could achieve with the Website Content Crawler.

I'll close this issue now, but feel free to ask additional questions if you have any. Cheers!

Add comment

Developer

Apify

Actor metrics

3.8k monthly users
616 stars
99.9% runs succeeded
3.4 days response time
Created in Mar 2023
Modified 3 days ago

Categories

Developer tools

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

159

Ultimate Screenshot

dz_omar/ultimate-screenshot

Ultimate Screenshot allows you to extract data in formats like JPEG, PNG, PDF, GIF, and MP4. It supports device emulation, including iPhones, Android phones, tablets, and desktops, or uses a default resolution of 1920x1080 for accurate, versatile screenshots and videos.

Omar Abdlhakim

Example Website Screenshot Crawler

dz_omar/example-website-screenshot-crawler

Automated website screenshot crawler using Pyppeteer and Apify. This open-source actor captures screenshots from specified URLs, uploads them to the Apify Key-Value Store, and provides easy access to the results, making it ideal for monitoring website changes and archiving web content.

Omar Abdlhakim

Web Scraper

apify/web-scraper

Crawls arbitrary websites using the Chrome browser and extracts data from pages using JavaScript code. The Actor supports both recursive crawling and lists of URLs and automatically manages concurrency for maximum performance. This is Apify's basic tool for web crawling and scraping.

Apify

71k

213

Google Maps Scraper

compass/crawler-google-places

Extract data from hundreds of Google Maps locations and businesses. Get Google Maps data including reviews, images, contact info, opening hours, location, popular times, prices & more. Export scraped data, run the scraper via API, schedule and monitor runs, or integrate with other tools.

Compass

77.9k

538

📩📍 Google Maps Email Extractor

lukaskrivka/google-maps-with-contact-details

Extract Google Maps contact details. Scrape websites of Google Maps places for contact details and get email addresses, website, location, address, zipcode, phone number, social media links. Export scraped data, run the scraper via API, schedule and monitor runs or integrate with other tools.

Lukáš Křivka

7.4k

158

Instagram Scraper

apify/instagram-scraper

Scrape and download Instagram posts, profiles, places, hashtags, photos, and comments. Get data from Instagram using one or more Instagram URLs or search queries. Export scraped data, run the scraper via API, schedule and monitor runs or integrate with other tools.

Apify

57.5k

459

Facebook Events Scraper

apify/facebook-events-scraper

Facebook Events Scraper extracts data such as event name, location, description or number of users who are interested. You can use URLs of specific events or come up with search queries and explore pretty much unlimited number of events. Search queries can be combined with various search filters.

Apify

866