📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.
Add navigationWaitUntil input option for browser to allow faster or slower loading depending on the use-case
2023-09-12
Features
Add maxArticlesPerStartUrl to input to limit the number of articles per start URL
2023-08-03
Features
Add onlyArticlesForLastDays to input for easier dynamic date filtering
2023-03-27
Changes
snapshotUrls output have been replaced by screenshotUrl
extendOutputFunction is run after all fields were assigned forfull control
Fixes
extendOutputFunction now correctly works with undefined fields for browser
2023-03-20
Features
Add crawlWholeSubdomain to input so you don't need to set pseudoUrls or linkSelector
Add onlySubdomainArticles to input to limit articles and enqueueing to the subdomain of the start URL
Add saveHtmlAsLink to input to save HTML of articles as a link in the output
Add referrer, startUrl and depth to output
2023-03-01
Features
Update SDK to version 3
2022-10-13
Features
Deprecate saveSnapshotsOfInvalidArticles input field in favor of new saveSnapshots input field that save for all articles.
Deprecate pageWaitSelector and instead add pageWaitSelectorCategory and pageWaitSelectorArticle inputs
2022-09-29
Features
Added infinite scroll feature for browsers with 3 inputs: scrollToBottom, scrollToBottomButtonSelector, scrollToBottomMaxSecs
2022-09-21
Features
Nicer messages explaining why an article was marked as invalid
Added saveSnapshotsOfInvalidArticles option to input
2021-6-17
Features
Added enqueueFromArticles option to enqueue articles from article pages to get even more articles from the website. You need to enable it in input.
Added scanSitemaps and sitemapUrls parameters. scanSitemaps automatically searches sitemaps for articles for each start URL and sitemapUrls allows you to add the sitemaps manually if necessary. Be careful that scanSitemaps may dump a huge amount of (sometimes old) article URLs into the scraping process
2021-03-12
Fixes
onlyNewArticles and onlyNewArticlesPerDomain was loading duplicate items which caused excess usage of dataset read.
2021-03-31
Features
Added new input option onlyNewArticlesPerDomain. This is much more efficient way to deduplicate articles, so use it instead of onlyNewArticles.
onlyNewArticlesPerDomain works also on local datasets
2021-01-21
Fix: Now works with Start URLs from a public spreadsheet
2020-09-28
Upgraded Apify version 0.21.0 that sometimes crashed at the start of the run
Added currentItem param to extendOutputFunction
Improved logs
Increased request timeouts to work better on very slow sites
2020-07-07
Added option to run with browser (Puppeteer)
Added option to wait for page load or for selector (browser only)
Added articleUrls directly as input option to parse directly on articles