Ai Web Scraper - Extract Data With Ease avatar

Ai Web Scraper - Extract Data With Ease

Try for free

Pay $3.00 for 1,000 results

Go to Store
Ai Web Scraper - Extract Data With Ease

Ai Web Scraper - Extract Data With Ease

eloquent_mountain/ai-web-scraper-extract-data-with-ease
Try for free

Pay $3.00 for 1,000 results

Ai Web Scraper enables scraping for everyone, including non-techies! It uses Google's Gemini LLM to scrape websites with natural language commands. It dynamically extracts data, no selector input needed, handles dynamic content, avoids bot detection, and outputs JSON or other formats.

AI Web Scraper

This AI Web Scraper is a powerful and flexible tool that uses the power of Large Language Models (LLMs), specifically Google's Gemini, to intelligently extract data from web pages. Unlike traditional scrapers that rely on pre-defined selectors, this actor lets you specify what data you need in natural language, and it automatically adapts to extract it.

DISCLAIMER!! : The goal of this scraper is to make it possible for everyone (also non- techies) to scrape. It is still quite experimental since it relies on certain vision capabilities, meaning results can be sometimes inconsistent or not entirely what you'd expect.

Important Note : A MINIMUM of 2 URLs is required, else the actor will fail Preferably from the same domain/page type e.g. two product detail pages from the same website

What Does This Actor Do?

This actor automates the process of web scraping by combining browser automation with AI-powered element identification. Here’s a breakdown of its key capabilities:

  • Dynamic Data Extraction: You BRIEFLY specify the data you need in natural language instructions e.g., "product name, product price", and the actor intelligently identifies and extracts those values from the web page.
  • Intelligent Element Identification: Leveraging Google's Gemini LLM, it analyzes web page screenshots to pinpoint the location of relevant elements and labels them, even if the website's structure is unfamiliar.
  • Automatic Adjustment: It automatically handles scrolling to capture information below-the-fold, and normalizes DOM element coordinates in order to accurately match with Gemini's bounding box output.
  • Flexible Data Structure: The actor returns the data in a structured JSON format, with labels derived from the user's instructions or the bounding boxes provided by Gemini itself, making it easy to use in your own applications, reports, or spreadsheets.
  • Avoids Bot Detection: Takes several measures to avoid bot detection (using realistic user-agents and headless browser settings).

How to Use the AI Web Scraper

Using this actor is straightforward:

  1. Create an Apify Account: Start with a free Apify account using your email.
  2. Open the Smart Web Data Extractor: Go to the actor page.
  3. Provide Instructions and URLs: Input your desired instructions e.g., product name, product price and one or more target URLs.
  4. Run the Actor: Click the "Start" button and wait for the data to be extracted.
  5. Download Your Data: Retrieve the scraped data in JSON format.

Important Note : A MINIMUM of 2 URLs is required, else the actor will fail Preferably from the same domain/page type e.g. two product detail pages from the same website

Input

To start scraping data, the actor accepts the following input parameters:

  • start_urls: An array of at least two URLs of the web pages you want to scrape
  • instructions: A list of items you wish to scrape e.g., "product name, product price"

Here’s an example of an input configuration in JSON format:

1{
2    "instructions": "Product name, Product price, SKU number, Product Dimensions",
3    "start_urls": [
4        "https://www.boontoon.com/metal-wall-hanging-of-lord-ganesha-divinity-and-elegance-bh-0848",
5		"https://www.boontoon.com/circular-yellow-bag-with-floral-print-and-elephant-design-rja-0036",
6		"https://www.ledlichtdiscounter.nl/1-fase-rail-connector-i-vorm-zwart.html"
7    ]
8}

Output

The output from this Actor is stored in a dataset. You can view this data in the Apify UI or download it in JSON, CSV or other formats. The data format will vary depending on the instructions you provide. Here is the example corresponding to the input example provided above:

1{
2  "url": "https://www.boontoon.com/metal-wall-hanging-of-lord-ganesha-divinity-and-elegance-bh-0848",
3  "data": {
4    "Product name": "Metal Wall Hanging Of Lord Ganesha- Divinity And Elegance",
5    "Product price": "₹ 470.00 /piece",
6    "SKU number": "0",
7    "Product Dimensions": "0"
8  }
9},
10{
11  "url": "https://www.boontoon.com/circular-yellow-bag-with-floral-print-and-elephant-design-rja-0036",
12  "data": {
13    "Product name": "Circular Yellow Bag With Floral Print And Elephant Design",
14    "Product price": "₹ 350.00 /piece",
15    "SKU number": "0",
16    "Product Dimensions": "0"
17  }
18},
19{
20  "url": "https://www.ledlichtdiscounter.nl/1-fase-rail-connector-i-vorm-zwart.html",
21  "data": {
22    "Product name": "1-Fase rail connector - I-vorm - Zwart",
23    "Product price": "€ 1,95",
24    "SKU number": "PLX849640"
25  }
26}

How Can I Use the Data Extracted with AI Web Scraper?

  • Market Research: Extract product information, pricing, and customer reviews for competitive analysis.
  • Content Aggregation: Collect data for news aggregation, research, or blog content.
  • Financial Analysis: Gather financial metrics and performance data from various financial websites.
  • E-commerce Intelligence: Extract and monitor product and pricing information from online stores.
  • Lead Generation: Collect relevant information for potential business opportunities.

How Does the AI Web Scraper Work?

The AI Web Scraper combines advanced browser automation with Google's Gemini LLM to offer a cutting-edge solution for web scraping. This actor operates in multiple stages to ensure efficient, accurate, and flexible data extraction.

1. Input Configuration

  • User Instructions: The scraper accepts natural language instructions describing the data to extract, such as "product name, price, and dimensions."
  • Start URLs: A list of URLs serves as the input target for scraping.

Important Note : A MINIMUM of 2 URLs is required, else the actor will fail Preferably from the same domain/page type e.g. two product detail pages from the same website

2. AI-Powered Element Detection

  • Screenshot Analysis: When provided multiple URLs of the same domain, only the first page is used to detect the elements. Consecutive URLs use CSS Selectors, preventing high LLM costs. The scraper takes a screenshot of the web page and uses the Gemini LLM to identify bounding boxes around relevant elements. This enables dynamic and adaptive data extraction without requiring hardcoded selectors.
  • Bounding Box Parsing: The bounding box coordinates returned by the AI model are mapped to the DOM structure of the web page.

3. Dynamic Scrolling for Comprehensive Coverage

  • If the desired elements aren't visible in the initial viewport, the scraper scrolls through the web page in increments.
  • At each step, it captures a new screenshot and reprocesses it through Gemini to identify elements below the fold.

4. Selector Extraction

  • For each matched DOM element, the scraper generates a CSS selector and extracts its corresponding HTML content. This enables robust and reusable data extraction across similar pages.

5. Data Extraction

  • The scraper retrieves text or attributes (e.g., innerHTML, text) from elements identified by the selectors, ensuring the collected data aligns precisely with user instructions.

6. Output Formatting

  • Extracted data is structured in JSON format. Each record contains the URL of the scraped page and the extracted data, making it easy to integrate into various applications.

7. Domain Optimization

  • To enhance efficiency, the scraper caches selectors for domains it has processed. Subsequent pages from the same domain reuse these selectors, reducing the need for repeated AI analysis.

Key Features:

  • AI-Driven Flexibility: Removes the need for predefined selectors by leveraging AI to dynamically identify elements.
  • Automated Scrolling: Ensures no data is missed, even on long pages with hidden content.
  • Bot Detection Avoidance: Implements strategies like custom user agents and headless browsing to minimize detection.
  • Structured Outputs: Outputs are clean and easy to use in JSON, CSV, or other formats.

Advantages of This Approach:

  • User-Friendly: Natural language instructions make it accessible to non-technical users.
  • Highly Adaptable: Capable of handling unknown or dynamically loaded page structures.
  • Time-Efficient: Domain-based selector caching reduces redundancy and speeds up processing.

Integrations

This Actor integrates with other Apify platform components and other external services:

  • Webhooks: Automatically notify you when the scraping is complete or send the data to another application.
  • API: Control the Actor programmatically using the Apify API.
  • Cloud Services: Use Apify integrations to automatically store the data in services like Google Sheets, Google Drive, Slack, and others.

Scrape Any Web Data You Need with This Dynamic Scraper

This AI Web Scraper is your one-stop solution for scraping any data you need. Whether it's a product name, a price, a news headline, or a financial metric, this actor adapts to extract it by analyzing the context of your instructions.

Not What You Need? Build Your Own!

If this actor doesn't exactly meet your needs, you can use one of the scraper templates available in Python, JavaScript, and TypeScript to get started or check out our open-source library Crawlee.

You can also request a custom scraping solution from us.

Your Feedback

Your feedback is valuable to us. If you have any suggestions or find a bug, please create an issue on the Actor's Issues tab in the Apify Console.

FAQ

How much does AI Web Scraper cost?

This actor uses Apify's Pay-per-result pricing model. Apify also provides you with free monthly usage credits.

How can I use AI Web Scraper with the Apify API?

You can access the Apify API programmatically via RESTful HTTP endpoints or SDKs (apify-client NPM package for JavaScript, apify-client PyPI package for Python) to run, manage, and get the data out of any actor.

This actor only extracts data that is publicly available. Please ensure that you comply with the terms and conditions of websites you scrape, and you are responsible for ensuring your compliance with data privacy regulations such as GDPR.

Developer
Maintained by Community

Actor Metrics

  • 55 monthly users

  • 2 stars

  • 51% runs succeeded

  • Created in Dec 2024

  • Modified 19 days ago