SuperScraper API avatar

SuperScraper API

Try for free

No credit card required

Go to Store
SuperScraper API

SuperScraper API

apify/super-scraper-api
Try for free

No credit card required

Generic REST API for scraping websites: send a URL and get back HTML. This Actor is a drop-in replacement for ScrapingBee, ScrapingAnt, and ScraperAPI services. And it is open-source!

Do you want to learn more about this Actor?

Get a demo

SuperScraper API is an Actor that provides a REST API for scraping websites. Just pass the URL of a web page and get back the fully rendered HTML content. SuperScraper API is compatible with ScrapingBee, ScrapingAnt, and ScraperAPI interfaces.

Main features:

  • Extract HTML from arbitrary URLs with a headless browser for dynamic content rendering.
  • Circumvent blocking using datacenter or residential proxies, as well as browser fingerprinting.
  • Seamlessly scale to a large number of web pages as needed.
  • Capture screenshots of the web pages.

Note that SuperScraper API uses the new experimental Actor Standby mode, so it's not started the traditional way from Apify Console. Instead, it's invoked via the HTTP REST API provided directly by the Actor. See the examples below.

Usage examples

To run these examples, you need an Apify API token, which you can find under Settings > Integrations in Apify Console.

You can create an Apify account free of charge.

Node.js

1import axios from 'axios';
2
3const resp = await axios.get('https://super-scraper-api.apify.actor/', {
4    params: {
5        url: 'https://apify.com/store',
6        wait_for: '.ActorStoreItem-title',
7        json_response: true,
8        screenshot: true,
9    },
10    headers: {
11        Authorization: 'Bearer <YOUR_APIFY_API_TOKEN>',
12    },
13});
14
15console.log(resp.data);

curl

1curl -X GET \
2  'https://super-scraper-api.apify.actor/?url=https://apify.com/store&wait_for=.ActorStoreItem-title&screenshot=true&json_response=true' \
3  --header 'Authorization: Bearer <YOUR_APIFY_API_TOKEN>'

Authentication

The best way to authenticate is to pass your Apify API token using the Authorization HTTP header. Alternatively, you can pass the API token via the token query parameter to authenticate the requests, which is more convenient for testing in a web browser.

Node.js

1const resp = await axios.get('https://super-scraper-api.apify.actor/', {
2    params: {
3        url: 'https://apify.com/store',
4        token: '<YOUR_APIFY_API_TOKEN>'
5    },
6});

curl

curl -X GET 'https://super-scraper-api.apify.actor/?url=https://apify.com/store&wait_for=.ActorStoreItem-title&json_response=true&token=<YOUR_APIFY_API_TOKEN>'

Pricing

When using SuperScraper API, you're charged based on your actual usage of the Apify platform's computing, storage, and networking resources.

Cost depends on the target sites, your settings and API parameters, the load of your requests, and random network and target site conditions.

The best way to see your price is to conduct a real-world test.

An example cost on a free account (the pricing is cheaper on higher plans) for 30 one-by-one requests plus 50 batched requests test:

parameterscost estimate
no render_js + basic proxy$1/1000 requests
no render_js + premium (residential) proxy$2/1000 requests
render_js + basic proxy$4/1000 requests
render_js + premium (residential) proxy$5/1000 requests

API parameters

ScrapingBee API parameters

SuperScraper API supports most of the API parameters of ScrapingBee:

parameterdescription
urlURL of the webpage to be scraped. This parameter is required.
json_responseReturn a verbose JSON response with additional details about the webpage. Can be either true or false, default is false.
extract_rulesA stringified JSON containing custom rules how to extract data from the webpage.
render_jsIndicates that the webpage should be scraped using a headless browser, with dynamic content rendered. Can be true or false, default is true. This is equivalent to ScrapingAnt's browser.
screenshotGet screenshot of the browser's current viewport. If json_response is set to true, screenshot will be returned in the Base64 encoding. Can be true or false, default is false.
screenshot_full_pageGet screenshot of the full page. If json_response is set to true, screenshot will be returned in the Base64 encoding. Can be true or false, default is false.
screenshot_selectorGet screenshot of the element specified by the selector. If json_response is set to true, screenshot will be returned in Base64. Must be a non-empty string.
js_scenarioJavaScript instructions that will be executed after loading the webpage.
waitSpecify a duration that the browser will wait after loading the page, in milliseconds.
wait_forSpecify a CSS selector of an element for which the browser will wait after loading the page.
wait_browserSpecify a browser event to wait for. Can be either load, domcontentloaded, or networkidle.
block_resourcesSpecify that you want to block images and CSS. Can be true or false, default is true.
window_widthSpecify the width of the browser's viewport, in pixels.
window_heightSpecify the height of the browser's viewport, in pixels.
cookiesCustom cookies to use to fetch the web pages. This is useful for fetching webpage behing login. The cookies must be specified in a string format: cookie_name_1=cookie_value1;cookie_name_2=cookie_value_2.
own_proxyA custom proxy to be used for scraping, in the format <protocol><username>:<password>@<host>:<port>.
premium_proxyUse residential proxies to fetch the web content, in order to reduce the probability of being blocked. Can be either true or false, default is false.
stealth_proxyWorks same as premium_proxy.
country_codeUse IP addresses that are geolocated in the specified country by specifying its 2-letter ISO code. When using code other than US, premium_proxy must be set to true. This is equivalent to setting ScrapingAnt's proxy_country.
custom_googleUse this option if you want to scrape Google-related websites (such as Google Searach or Google Shopping). Can be true or false, default is false.
return_page_sourceReturn HTML of the webpage from the response before any dynamic JavaSript rendering. Can be true or false, default is false.
transparent_status_codeBy default, if target webpage responds with HTTP status code other than a 200-299 or a 404, the API will return a HTTP status code 500. Set this paremeter to true to disable this behavior and return the status code of the actual response.
timeoutSet maximum timeout for the response from this Actor, in milliseconds. The default is 140 000 ms.
forward_headersIf set to true, HTTP headers starting with prefix Spb- or Ant- will be forwarded to the target webpage alongside headers generated by us (the prefix will be trimmed).
forward_headers_pureIf set to true, only headers starting with prefix Spb- or Ant- will be forwarded to the target webpage (prefix will be trimmed), without any other HTTP headers from our side.
deviceCan be either desktop (default) or mobile.

ScrapingBee's API parameters block_ads and session_id are currently not supported.

ScrapingAnt API parameters

SuperScraper API supports most of the API parameters of ScrapingAnt:

parameterdescription
urlURL of the webpage to be scraped. This parameter is required.
browserIndicates that the webpage should be scraped using a headless browser, with dynamic content rendered. Can be true or false, default is true. This is equivalent as ScrapingBee's render_js.
cookiesUse custom cookies, must be in a string format: cookie_name_1=cookie_value1;cookie_name_2=cookie_value_2.
js_snippetA Base64-encoded JavaScript code to be executed on the webpage. Will be treated as the evaluate instruction.
proxy_typeSpecify the type of proxies, which can be either datacenter (default) or residential. This is equivalent to setting ScrapingBee's premium_proxy or steath_proxy to true.
wait_for_selectorSpecify a CSS selector of an element for which the browser will wait after loading the page. This is equivalent to setting ScrapingBee's wait_for.
block_resourceSpecify one or more resources types you want to block from being downloaded. The parameter can be repeated in the URL (e.g. block_resource=image&block_resource=media). Available options are: document, stylesheet, image, media, font, script, texttrack, xhr, fetch, eventsource, websocket, manifest, other.
return_page_sourceReturn HTML of the webpage from the response before any dynamic JavaSript rendering. Can be true or false, default is false.
proxy_countryUse IP addresses that are geolocated in the specified country by specifying its 2-letter ISO code. When using code other than US, premium_proxy must be set to true. This is equivalent to setting ScrapingBee's country_code.

ScrapingAnt's API parameter x-api-key is not supported.

Note that HTTP headers in a request to this Actor beginning with prefix Ant- will be forwarded (without the prefix) to the target webpage alongside headers generated by the Actor. This behavior can be changed using ScrapingBee's forward_headers or forward_headers_pure parameters.

ScraperAPI API parameters

SuperScraper API supports most of the API parameters of ScraperAPI:

parameterdescription
urlURL of the webpage to be scraped. This parameter is required.
renderSpecify, if you want to scrape the webpage with or without using a headless browser, can be true or false, default true. (Same as render_js.)
wait_for_selectorSpecify a CSS selector of an element for which the browser will wait after loading the page. This is equivalent to setting ScrapingBee's wait_for.
premiumUse residential proxies to fetch the web content, in order to reduce the probability of being blocked. Can be either true or false, default is false. This is equivalent to setting ScrapingBee's premium_proxy.
ultra_premiumSame as premium.
country_codeUse IP addresses that are geolocated in the specified country by specifying its 2-letter ISO code. When using code other than US, premium_proxy must be set to true. This is equivalent to setting ScrapingAnt's proxy_country.
keep_headersIf true, then all headers sent to this Actor will be forwarded to the target website. The Authorization header will be removed.
device_typeCan be either desktop (default) or mobile. This is equivalent to setting ScrapingBees's device.
binary_targetSpecify whether the target is a file. Can be true or false, default is false. Currently only supported when JS rendering is set to false via the render_js, browser, or render parameters.

ScraperAPI's API parameters session_number and autoparse are currently not supported, and they are ignored.

Custom extraction rules

Using ScrapingBee's extract_rules parameter, you can specify a set of rules to extract specific data from the target web pages. You can create an extraction rule in one of two ways: with shortened options, or with full options.

Shortened options

  • value for the given key serves as a selector
  • using @, we can access attribute of the selected element
Example:
1{
2    "title": "h1",
3    "link": "a@href"
4}

Full options

  • selector is required
  • type can be either item (default) or list
  • output indicates how the result for these element(s) will look like. It can be:
    • text (default option when output is omitted) - text of the element
    • html - HTML of the element
    • attribute name (starts with @, for example @href)
    • object with other extract rules for the given item (key + shortened or full options)
    • table_json or table_array to scrape a table in a json or array format
  • clean - relevant when having text as output, specifies whether the text of the element should be trimmed of whitespaces (can be true or false, default true)
Example:
1{
2    "custom key for links": {
3        "selector": "a",
4        "type": "list",
5        "output": {
6            "linkName" : {
7                "selector": "a",
8                "clean": "false"
9            },
10            "href": {
11                "selector": "a",
12                "output": "@href"
13            }
14        }
15
16    }
17}

Example

This example extracts all links from Apify Blog along with their titles.

1const extractRules = {
2    title: 'h1',
3    allLinks: {
4        selector: 'a',
5        type: 'list',
6        output: {
7            title: 'a',
8            link: 'a@href',
9        },
10    },
11};
12
13const resp = await axios.get('https://super-scraper-api.apify.actor/', {
14    params: {
15        url: 'https://blog.apify.com/',
16        extract_rules: JSON.stringify(extractRules),
17        // verbose: true,
18    },
19    headers: {
20        Authorization: 'Bearer <YOUR_APIFY_API_TOKEN>',
21    },
22});
23
24console.log(resp.data);

The results look like this:

1{
2  "title": "Apify Blog",
3  "allLinks": [
4    {
5      "title": "Data for generative AI & LLM",
6      "link": "https://apify.com/data-for-generative-ai"
7    },
8    {
9      "title": "Product matching AI",
10      "link": "https://apify.com/product-matching-ai"
11    },
12    {
13      "title": "Universal web scrapers",
14      "link": "https://apify.com/store/scrapers/universal-web-scrapers"
15    }
16  ]
17}

Custom JavaScript code

Use ScrapingBee's js_scenario parameter to specify instructions in order to be executed one by one after opening the page.

Set json_response to true to get a full report of the executed instructions, the results of evaluate instructions will be added to the evaluate_results field.

Example of clicking a button:

1const instructions = {
2    instructions: [
3        { click: '#button' },
4    ],
5};
6
7const resp = await axios.get('https://super-scraper-api.apify.actor/', {
8    params: {
9        url: 'https://www.example.com',
10        js_scenario: JSON.stringify(instructions),
11    },
12    headers: {
13        Authorization: 'Bearer <YOUR_APIFY_API_TOKEN>',
14    },
15});
16
17console.log(resp.data);

Strict mode

If one instruction fails, then the subsequent instructions will not be executed. To disable this behavior, you can optionally set strict to false (by default it's true):

1{
2    "instructions": [
3        { "click": "#button1" },
4        { "click": "#button2" }
5    ],
6    "strict": false
7}

Supported instructions

wait
  • wait for some time specified in ms
  • example: {"wait": 10000}
wait_for
  • wait for an element specified by the selector
  • example {"wait_for": "#element"}
click
  • click on an element specified by the selector
  • example {"click": "#button"}
wait_for_and_click
  • combination of previous two
  • example {"wait_for_and_click": "#button"}
scroll_x and scroll_y
  • scroll a specified number of pixels horizontally or vertically
  • example {"scroll_y": 1000} or {"scroll_x": 1000}
fill
  • specify a selector of the input element and the value you want to fill
  • example {"fill": ["input_1", "value_1"]}
evaluate
  • evaluate custom javascript on the webpage
  • text/number/object results will be saved in the evaluate_results field
  • example {"evaluate":"document.querySelectorAll('a').length"}
Developer
Maintained by Apify

Actor Metrics

  • 25 monthly users

  • 29 stars

  • 76% runs succeeded

  • Created in May 2024

  • Modified 2 months ago

Categories