Smart Article Extractor

Pricing

Pay per usage

Try for free

Go to Apify Store

Smart Article Extractor

Try for free

Developed by

Lukáš Křivka

Maintained by Apify

📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

4.7 (6)

Pricing

Pay per usage

155

5.9K

394

Issues response

30 days

Last modified

6 months ago

News

2024-03-21

Features

Add navigationWaitUntil input option for browser to allow faster or slower loading depending on the use-case

2023-09-12

Features

Add maxArticlesPerStartUrl to input to limit the number of articles per start URL

2023-08-03

Features

Add onlyArticlesForLastDays to input for easier dynamic date filtering

2023-03-27

Changes

snapshotUrls output have been replaced by screenshotUrl
extendOutputFunction is run after all fields were assigned forfull control

Fixes

extendOutputFunction now correctly works with undefined fields for browser

2023-03-20

Features

Add crawlWholeSubdomain to input so you don't need to set pseudoUrls or linkSelector
Add onlySubdomainArticles to input to limit articles and enqueueing to the subdomain of the start URL
Add saveHtmlAsLink to input to save HTML of articles as a link in the output
Add referrer, startUrl and depth to output

2023-03-01

Features

Update SDK to version 3

2022-10-13

Features

Deprecate saveSnapshotsOfInvalidArticles input field in favor of new saveSnapshots input field that save for all articles.
Deprecate pageWaitSelector and instead add pageWaitSelectorCategory and pageWaitSelectorArticle inputs

2022-09-29

Features

Added infinite scroll feature for browsers with 3 inputs: scrollToBottom, scrollToBottomButtonSelector, scrollToBottomMaxSecs

2022-09-21

Features

Nicer messages explaining why an article was marked as invalid
Added saveSnapshotsOfInvalidArticles option to input

2021-6-17

Features

Added enqueueFromArticles option to enqueue articles from article pages to get even more articles from the website. You need to enable it in input.
Added scanSitemaps and sitemapUrls parameters. scanSitemaps automatically searches sitemaps for articles for each start URL and sitemapUrls allows you to add the sitemaps manually if necessary. Be careful that scanSitemaps may dump a huge amount of (sometimes old) article URLs into the scraping process

2021-03-12

Fixes

onlyNewArticles and onlyNewArticlesPerDomain was loading duplicate items which caused excess usage of dataset read.

2021-03-31

Features

Added new input option onlyNewArticlesPerDomain. This is much more efficient way to deduplicate articles, so use it instead of onlyNewArticles.
onlyNewArticlesPerDomain works also on local datasets

2021-01-21

Fix: Now works with Start URLs from a public spreadsheet

2020-09-28

Upgraded Apify version 0.21.0 that sometimes crashed at the start of the run
Added currentItem param to extendOutputFunction
Improved logs
Increased request timeouts to work better on very slow sites

2020-07-07

Added option to run with browser (Puppeteer)
Added option to wait for page load or for selector (browser only)
Added articleUrls directly as input option to parse directly on articles

News Website Crawler & Article Extractor

xtech/news-source-crawler

Scrape all articles from any news website. Extract full text, metadata, keywords, and summaries. Ideal for content analysis, research, and news aggregation.

Xtech

184

Articles Extractor

web.harvester/articles-extractor

The Article Extractor is an enterprise-grade web scraping solution designed specifically for extracting structured data from news articles, blog posts, and online publications. Our advanced HTML parsing engine delivers unmatched accuracy in content extraction across thousands of websites.

Web Harvester

593

5.0

Ultimate Articles Extractor

web.harvester/ultimate-articles-extractor

A powerful and modular web scraping tool designed to extract content from any webpage, article, or news site. Get clean, structured data from any website with optimized extraction algorithms, anti-bot detection avoidance, and proxy support.

Web Harvester

5.0

News Article Scraper for Feeding LLM

proscraper/newsarticlescraper

Scrape news articles metadata to feed into LLM models. Returns article body, published date, article title, author etc.

Owais Nazir

News Articles Scraper

proscraper/news-articles-scraper

Scrape data for news articles. Takes in list of URL's in start_urls and returns the data. Can be used to feed LLM models or training.

Owais Nazir

Article Content Extractor 📄

easyapi/article-content-extractor

Extract clean article content, metadata and structured information from any web page. Supports multiple URLs and returns well-formatted JSON with title, description, content, author, publish date and more. 🔍📄

EasyApi

Smart Article Scraper - Text, Data & Insights

xtech/article-extractor

Unlock valuable insights from any article! Get clean text, publication data, keywords, summaries, and more. Ideal for research, content marketing, and competitive analysis. Fast, reliable, and easy to use.

Xtech

1.0

Advanced News Scraper

dorcy/advanced-news-scraper

This scraper is crafted to extract the latest news articles based on custom search queries, providing a wealth of information, including article titles, sources, publication dates, full article text, and AI-generated summary.

Dorcy Shema

219

🤖 Any Website URL to Article Summarizer

easyapi/any-website-url-to-article-summarizer

Transform any article, blog post, or web content into concise, AI-powered summaries. Get key insights and main points instantly with smart text analysis and markdown formatting. Perfect for researchers, content creators, and busy professionals who need quick, accurate content digests.

EasyApi

5.0

Tech News Article Scraper

inquisitive_sarangi/news-article-scraper

Tech News Article Scraper is a simple yet powerful tool to extract news articles from a variety of popular news websites. Supported The Verge, CNET, Wired, TechCrunch, Ars Technica

API Master

Smart Article Extractor

Smart Article Extractor

2024-03-21

2023-09-12

2023-08-03

2023-03-27

2023-03-20

2023-03-01

2022-10-13

2022-09-29

2022-09-21

2021-6-17

2021-03-12

2021-03-31

2021-01-21

2020-09-28

2020-07-07

You might also like

News Website Crawler & Article Extractor

Articles Extractor

Ultimate Articles Extractor

News Article Scraper for Feeding LLM

News Articles Scraper

Article Content Extractor 📄

Smart Article Scraper - Text, Data & Insights

Advanced News Scraper

🤖 Any Website URL to Article Summarizer

Tech News Article Scraper

2024-03-21

2023-09-12

2023-08-03

2023-03-27

2023-03-20

2023-03-01

2022-10-13

2022-09-29

2022-09-21

2021-6-17

2021-03-12

2021-03-31

2021-01-21

2020-09-28

2020-07-07