
LLMScraper
Pricing
Pay per usage
Go to Store

LLMScraper
Find best scraper for your website and data you need.
0.0 (0)
Pricing
Pay per usage
0
Total users
1
Monthly users
1
Last modified
5 days ago
🤖 LLM-Powered Web Scraper
An intelligent Apify Actor that uses Claude AI to automatically discover, test, and select the best Apify actors for your web scraping tasks. No manual configuration needed!
✨ Features
- 🧠 AI-Powered Actor Discovery: Uses Claude AI to automatically find and test the best Apify actors for your target website
- 🔄 Smart Retry Logic: Automatically adjusts parameters and retries failed attempts with different actors
- 📊 Quality Assessment: Evaluates scraped data quality across multiple dimensions (completeness, relevance, structure, volume)
- 🎯 Priority-Based Testing: Tests domain-specific actors first, then falls back to general-purpose ones
- 📈 Real-time Progress: Tracks and reports scraping progress with detailed logging
- 🔗 MCP Integration: Connects to Apify MCP Server for dynamic actor discovery and execution
- ⚙️ Flexible Configuration: Extensive customization options for timeout, quality thresholds, and model selection
🚀 Quick Start
- Set up your Claude API key in the Actor input or as an environment variable
- Provide your target URL and describe what data you want to extract
- Run the Actor - it will automatically find and test the best scraping approach
Example Input
{"targetUrl": "https://example-ecommerce.com/products/","extractionGoal": "Extract product information including title, price, description, and availability","claudeApiKey": "sk-ant-api03-...","maxActorAttempts": 5,"maxTimeMinutes": 20}
📝 Input Configuration
Required Fields
targetUrl
: The URL of the website you want to scrapeextractionGoal
: Describe what data you want to extract from the websiteclaudeApiKey
: Your Anthropic Claude API key for AI-powered analysis
Optional Configuration
maxActorAttempts
(default: 10): Maximum number of different actors to trymaxRetriesPerActor
(default: 3): Maximum retry attempts per actormaxTimeMinutes
(default: 30): Maximum total execution time in minutesmodelName
(default: "claude-3-5-haiku-latest"): Claude model to usedebugMode
(default: false): Enable detailed loggingpreferSpecificActors
(default: true): Prioritize domain-specific actorsminDataQualityScore
(default: 70): Minimum quality score (0-100) to accept resultsenableProxy
(default: true): Use proxy for scraping requests
Available Claude Models
claude-3-5-haiku-latest
- Fast & cost-effective (recommended)claude-3-5-sonnet-latest
- Balanced performance and qualityclaude-3-opus-latest
- Maximum quality (slower, more expensive)
📊 Output
The Actor saves results to:
Dataset
Each scraped item with metadata:
{"url": "https://example.com","data": {...},"quality_score": 0.85,"actor_used": "apify/web-scraper","timestamp": "2025-07-24T11:30:00Z","success": true,"extraction_goal": "Extract product information","total_execution_time": 45.2,"attempts_made": 3}
Key-Value Store
Summary information in SCRAPING_RESULT
:
{"success": true,"quality_score": 0.85,"items_count": 25,"best_actor_id": "apify/web-scraper","total_execution_time": 45.2,"attempts_made": 3,"progress_updates": [...],"actor_attempts": [...]}
🔧 How It Works
- Actor Discovery: Connects to Apify MCP Server to discover available actors
- AI Analysis: Uses Claude to analyze the target website and select appropriate actors
- Smart Testing: Tests actors in priority order with intelligent parameter adjustment
- Quality Evaluation: Assesses data quality using multiple metrics
- Retry Logic: Automatically retries with different parameters if needed
- Result Selection: Returns the best results based on quality scores
🏗️ Architecture
The Actor consists of several key components:
- MCP Client (
src/llmscraper/mcp/
): Handles communication with Apify MCP Server - Claude Manager (
src/llmscraper/claude/
): Manages AI conversations and tool calls - LLM Scraper Actor (
src/llmscraper/llm_scraper/
): Main orchestration logic - Retry Logic (
src/llmscraper/llm_scraper/retry_logic.py
): Intelligent parameter adjustment - Quality Evaluator (
src/llmscraper/llm_scraper/quality_evaluator.py
): Data quality assessment
🔑 Environment Variables
ANTHROPIC_API_KEY
: Your Anthropic Claude API key (alternative to input field)APIFY_TOKEN
: Automatically provided by Apify platformMCP_SERVER_URL
: Custom MCP server URL (optional)
⚡ Performance Tips
- Use Haiku Model: For most tasks,
claude-3-5-haiku-latest
provides the best speed/cost ratio - Adjust Attempts: Reduce
maxActorAttempts
for faster results, increase for better coverage - Quality Threshold: Lower
minDataQualityScore
if you're getting no results - Time Limits: Set appropriate
maxTimeMinutes
based on your needs
🛠️ Development
Local Testing
# Install dependencies (using virtual environment)pip install -r requirements.txt# Or if you have the project's virtual environment:./venv/bin/pip install -r requirements.txt# Set up environmentexport ANTHROPIC_API_KEY=your_key_here# Run the actor locallypython3 main.py# Or using npm scripts:npm run start # Uses system python3npm run start:local # Uses project virtual environment
Project Structure
LLMScraper/├── main.py # Actor entry point├── src/llmscraper/│ ├── mcp/ # MCP client implementation│ ├── claude/ # Claude AI integration│ ├── llm_scraper/ # Main scraper logic│ │ ├── actor.py # Main LLMScraperActor class│ │ ├── models.py # Input/output models│ │ ├── retry_logic.py # Intelligent retry logic│ │ └── quality_evaluator.py # Data quality assessment│ ├── scraping/ # Apify actor integrations│ └── utils/ # Configuration and utilities├── .actor/│ ├── actor.json # Actor metadata│ ├── input_schema.json # Input validation schema│ └── README.md # This file├── Dockerfile # Container configuration├── requirements.txt # Python dependencies├── package.json # Node.js metadata└── pyproject.toml # Python packaging configuration
📚 API Reference
Main Function
from llmscraper.llm_scraper import LLMScraperActor, LLMScraperInput# Create configurationconfig = LLMScraperInput(target_url="https://example-website.com",extraction_goal="Extract product data",anthropic_api_key="sk-ant-...")# Run the scraperscraper = LLMScraperActor(config)result = await scraper.run(progress_callback=None)
Configuration
from llmscraper.llm_scraper.models import LLMScraperInputconfig = LLMScraperInput(target_url="https://example-website.com",extraction_goal="Extract product data",anthropic_api_key="sk-ant-...",max_actor_attempts=10,max_retries_per_actor=3,max_time_minutes=30,model_name="claude-3-5-haiku-latest",debug_mode=False,prefer_specific_actors=True,min_data_quality_score=0.7, # Note: API expects 0.0-1.0, input form uses 0-100enable_proxy=True)
📄 License
MIT License - see LICENSE file for details.
🆘 Support & Troubleshooting
Common Issues
- API Key Issues: Ensure your Claude API key is valid and has sufficient credits
- No Results Found: Try reducing
minDataQualityScore
or increasingmaxActorAttempts
- Timeout Errors: Increase
maxTimeMinutes
for complex websites - Quality Score Too Low: Adjust your
extractionGoal
to be more specific
Debugging
- Enable
debugMode: true
for detailed logging - Check the Actor logs for step-by-step execution details
- Verify the target URL is accessible and returns content
- Monitor the progress updates in the key-value store
Performance Optimization
- Use
claude-3-5-haiku-latest
for faster, cost-effective processing - Set appropriate
maxActorAttempts
based on your time/quality requirements - Enable
preferSpecificActors
to prioritize domain-specific solutions
On this page
Share Actor: