
News Website Crawler & Article Extractor
2 hours trial then $20.00/month - No credit card required now

News Website Crawler & Article Extractor
2 hours trial then $20.00/month - No credit card required now
Scrape all articles from any news website. Extract full text, metadata, keywords, and summaries. Ideal for content analysis, research, and news aggregation.
News Source Crawler π°π (Apify Actor)
Crawl an entire news website and extract clean, structured data from all its articles. Get article text, metadata, keywords, summaries, and more β perfect for content analysis, market research, news aggregation, and SEO monitoring. No coding required!
Pricing π°
- $35/month for unlimited usage
- Includes all features and Apify platform benefits
- No additional costs or hidden fees
Features β¨
- Full Website Crawl: π Scrapes articles from a specified news source URL
- Comprehensive Article Extraction: π° Get full article text, publication date, author(s), and source URL
- SEO & Content Analysis: π Extract keywords, meta descriptions, and automatically generated summaries
- Multimedia Extraction: πΌοΈ Get links to the main image, all images, and embedded videos
- Language Support: π Specify the article language
- Limit Articles: π’ Set a maximum number of articles to scrape (optional)
- Proxy Support: βοΈ Integrates with Apify Proxy for reliable scraping or use your custom proxy
- Analysis-Ready Data (JSON): πΎ Structured data output, perfect for analysis and integration
- Error Handling: β Robust error handling
Why Use This News Source Crawler? π€
This Actor is designed to efficiently extract data from entire news websites. It crawls all linked articles from a starting URL, making it ideal for:
- Large-Scale Data Collection: Quickly gather data from an entire news source
- Comprehensive Analysis: Analyze the content, trends, and SEO strategies of a website
- Automated News Feeds: Build custom news feeds with structured data
- Time Savings: Automate the process of collecting articles from a specific source
Data Output π¦
The Actor pushes data to the dataset as it scrapes, providing results in real-time. Each item represents a single article (or an error) and contains the following fields:
articleURL
: The URL of the scraped articlesourceURL
: The base URL of the news sourcearticleLanguage
: The language of the article (e.g., "en", "es")articleTitle
: The title of the articlearticleAuthors
: A comma-separated list of the article's authorsarticlePublishDate
: The publication date (ISO 8601 format), if availablearticleText
: The full text content of the articlearticleTopImage
: The URL of the main imagearticleAllImages
: A comma-separated list of URLs for all imagesarticleVideos
: A comma-separated list of URLs for embedded videosarticleKeywords
: A comma-separated list of extracted keywordsarticleSummary
: A concise summary of the articlescrapedAt
: The timestamp of when the article was scraped (ISO 8601)scrapeSuccess
:true
if scraped successfully,false
otherwisearticleMetaDescription
: The meta description of the articlearticleMetaKeywords
: A comma-separated list of the meta keywordsscrapeErrorMessage
: An error message ifscrapeSuccess
isfalse
Example Output
1[ 2 { 3 "articleURL": "https://www.example.com/news/article1", 4 "sourceURL": "https://www.example.com", 5 "articleLanguage": "en", 6 "articleTitle": "Example News Article", 7 "articleAuthors": "John Doe, Jane Smith", 8 "articlePublishDate": "2024-07-27T10:00:00Z", 9 "articleText": "This is the full text of the example article...", 10 "articleTopImage": "https://www.example.com/images/article1.jpg", 11 "articleAllImages": "https://www.example.com/images/article1.jpg,https://www.example.com/images/article2.png", 12 "articleVideos": "", 13 "articleKeywords": "news, example, article", 14 "articleSummary": "A brief summary of the example article.", 15 "scrapedAt": "2024-07-27T12:34:56Z", 16 "scrapeSuccess": true, 17 "articleMetaDescription": "Meta description of the example news article.", 18 "articleMetaKeywords": "example, article, news" 19 } 20]
Use Cases π‘
Content Marketing & SEO π’
- Competitor Analysis: Track all content published by competitors
- Content Audits: Analyze an entire website's content strategy
- Keyword Research: Identify trending topics across a whole site
- Backlink Monitoring: Find sites linking to a news source
- Brand Monitoring: Monitor your brand
Market Research & Business Intelligence π
- News Aggregation: Build comprehensive news feeds from specific sources
- Trend Analysis: Identify emerging trends within a news domain
- Sentiment Analysis: Analyze the tone and sentiment of articles from a source
Academic Research π
- Data Collection: Gather large datasets of articles for research
- Text Analysis: Analyze the content of entire news websites
- Gather Specific Information: Gather articles of a specific niche
Other Applications π
- Machine Learning: Train models with large sets of scraped articles
- Content Curation: Easily find and collect relevant articles
Getting Started π
-
Find the "News Source Crawler" in the Apify Store
-
Configure the input:
url
: (Required) The URL of the news website to crawllanguage
: (Optional) The expected language (default: "en")maxArticles
: (Optional) The maximum number of articles to scrapeproxyConfiguration
: (Optional) Select an Apify Proxy configuration or provide custom proxies
-
Run the Actor
-
Access results in JSON, CSV, Excel, or other formats, directly from the dataset as the Actor runs
-
Optional: Schedule the Actor, set up webhooks, or integrate with other Actors
Key Benefits π
Data Quality
- β Reliable & Accurate: Provides high-quality extracted data
- β Clean Data: Extracts only the relevant information
- β Structured Format: Easy to use and integrate
Platform Advantages (Apify)
- β Scalable & Serverless: Handles large crawls without infrastructure management
- β Cost-Effective: Pay only for what you use
- β Full Apify Integration: Connects seamlessly with other Apify tools
- β User-Friendly: No coding required β simple input form
- β Real-time Results: Data is pushed to the dataset as it's scraped
- β Automated Updates: The Actor is maintained and updated
- β Isolated Runs: Each run is in a fresh, isolated container
Start crawling news sources today! β‘οΈ
Actor Metrics
13 monthly users
-
1 bookmark
>99% runs succeeded
6.4 days response time
Created in Feb 2025
Modified 24 days ago