Website Image Scraper

Pricing

from $0.80 / 1,000 images

Try for free

Go to Apify Store

Website Image Scraper

Try for free

Developed by

F. Gutz

Maintained by Community

Website Image Scraper is a fast, lightweight tool that crawls websites to extract image URLs (jpg, png, svg) without downloading files or using browsers. It supports recursive crawling, respects robots.txt, and efficiently collects image links for analysis or monitoring or a later download.

0.0 (0)

Pricing

from $0.80 / 1,000 images

Last modified

a month ago

Developer tools

Automation

Open source

.dockerignore

# configurations
.idea
.vscode
.zed

# crawlee and apify storage folders
apify_storage
crawlee_storage
storage

# installed files
node_modules

# git folder
.git

.editorconfig

root = true

[*]
indent_style = space
indent_size = 4
charset = utf-8
trim_trailing_whitespace = true
insert_final_newline = true
end_of_line = lf
quote_type = single

.gitignore

# This file tells Git which files shouldn't be added to source control

.DS_Store
.idea
.vscode
.zed
node_modules
storage

# Added by Apify CLI
.venv

.prettierrc

{
    "printWidth": 120,
    "tabWidth": 4,
    "singleQuote": true
}

Dockerfile

# Specify the base Docker image. You can read more about
# the available images at https://docs.apify.com/sdk/js/docs/guides/docker-images
# You can also use any other image from Docker Hub.
FROM apify/actor-node:22

# Check preinstalled packages
RUN npm ls crawlee apify puppeteer playwright

# Copy just package.json and package-lock.json
# to speed up the build using Docker layer cache.
COPY package*.json ./




# Install NPM packages, skip optional and development dependencies to
# keep the image small. Avoid logging too much and print the dependency
# tree for debugging
RUN npm --quiet set progress=false \
    && npm install --omit=dev --omit=optional \
    && echo "Installed NPM packages:" \
    && (npm list --omit=dev --all || true) \
    && echo "Node.js version:" \
    && node --version \
    && echo "NPM version:" \
    && npm --version \
    && rm -r ~/.npm

# Next, copy the remaining files and directories with the source code.
# Since we do this after NPM install, quick build will be really fast
# for most source file changes.
COPY . ./

# Create and run as a non-root user.
RUN adduser -h /home/apify -D apify && \
    chown -R apify:apify ./
USER apify

# Run the image.
CMD npm start --silent

eslint.config.mjs

1import prettier from 'eslint-config-prettier';
2
3import apify from '@apify/eslint-config/js.js';
4
5// eslint-disable-next-line import/no-default-export
6export default [{ ignores: ['**/dist'] }, ...apify, prettier];

package.json

{
	"name": "website-image-crawler",
	"version": "0.0.1",
	"type": "module",
	"description": "This is a boilerplate of an Apify Actor.",
	"engines": {
		"node": ">=18.0.0"
	},
	"dependencies": {
		"apify": "^3.4.2",
		"cheerio": "^1.1.0",
		"got": "^14.4.7",
		"robots-txt-guard": "^1.0.2",
		"robots-txt-parse": "^2.0.1"
	},
	"devDependencies": {
		"@apify/eslint-config": "^1.0.0",
		"eslint": "^9.29.0",
		"eslint-config-prettier": "^10.1.5",
		"prettier": "^3.5.3"
	},
	"scripts": {
		"start": "node ./src/main.js",
		"format": "prettier --write .",
		"format:check": "prettier --check .",
		"lint": "eslint",
		"lint:fix": "eslint --fix",
		"test": "echo \"Error: oops, the Actor has no tests yet, sad!\" && exit 1"
	},
	"author": "It's not you it's me",
	"license": "ISC"
}

.actor/actor.json

{
	"actorSpecification": 1,
	"name": "website-image-scraper",
	"title": "Website Image Scraper",
	"description": "Crawls websites to index URLs and download all images without duplicates",
	"version": "1.2",
	"buildTag": "latest",
    "input": "./input_schema.json",
    "storages": {
        "dataset": "./dataset_schema.json"
    },
	"meta": {
		"templateId": "js-empty"
	},
	"dockerfile": "../Dockerfile",
    "defaultRunOptions": {
        "build": "latest",
        "timeoutSecs": 3600,
        "memoryMbytes": 1024
    }
}

.actor/dataset_schema.json

{
    "actorSpecification": 1,
    "views": {
        "overview": {
            "title": "Overview",
            "transformation": {
                "fields": ["url", "sourcePage", "foundAt"]
            },
            "display": {
                "component": "table",
                "properties": {
                    "url": { "label": "Image URL", "format": "link" },
                    "sourcePage": { "label": "Source Page", "format": "link" }
                }
            }
        }
    }

}

.actor/input_schema.json

{
    "title": "Image Scraper Input",
    "description": "Configure the image scraper to crawl websites and extract all image URLs without duplicates.",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "startUrl": {
            "title": "Start URL",
            "type": "string",
            "description": "The URL of the website to start scraping images from",
            "prefill": "https://example.com",
            "example": "https://example.com",
            "editor": "textfield"
        },
        "maxCrawlDepth": {
            "title": "Maximum Crawl Depth",
            "type": "integer",
            "description": "How deep should the crawler navigate through links (0 = only start URL, 1 = start URL + direct links, etc.)",
            "default": 1,
            "prefill": 1,
            "example": 2,
            "minimum": 0,
            "maximum": 5,
            "editor": "number"
        },
        "maxConcurrency": {
            "title": "Maximum Concurrency",
            "type": "integer",
            "description": "Number of pages processed in parallel. Higher values = faster crawling but more resource usage",
            "default": 10,
            "prefill": 10,
            "example": 5,
            "minimum": 1,
            "maximum": 50,
            "editor": "number"
        },
        "imageExtensions": {
            "title": "Image Extensions",
            "type": "array",
            "description": "List of image file extensions to search for",
            "default": ["jpg", "jpeg", "png", "gif", "webp", "bmp", "svg"],
            "prefill": ["jpg", "jpeg", "png", "gif", "webp", "bmp", "svg"],
            "example": ["jpg", "png", "gif"],
            "editor": "stringList"
        },
        "useScope": {
            "title": "Restrict Crawling to Specific Domains",
            "type": "boolean",
            "description": "Enable to limit crawling to the domains listed in the scope list below",
            "default": false,
            "prefill": false,
            "sectionCaption": "Scope Control",
            "sectionDescription": "Restrict or allow which domains the crawler is allowed to follow"
        },
        "scope": {
            "title": "Allowed Domains",
            "type": "array",
            "description": "List of domains the crawler is allowed to follow (e.g., example.com). Only active if domain restriction is enabled above.",
            "editor": "stringList",
            "prefill": ["example.com"]
        },
        "includeSubdomains": {
            "title": "Include Subdomains",
            "type": "boolean",
            "description": "If enabled, subdomains of allowed domains will also be crawled",
            "default": false
        },
        "respectRobotsTxt": {
            "title": "Respect robots.txt",
            "type": "boolean",
            "description": "Whether to respect robots.txt files when crawling",
            "default": true,
            "prefill": true,
            "sectionCaption": "Advanced Settings",
            "sectionDescription": "These settings are for advanced users who want to fine-tune the scraper behavior"
        },
        "userAgent": {
            "title": "User Agent",
            "type": "string",
            "description": "Custom User-Agent string to use for requests",
            "default": "Mozilla/5.0 (compatible; ApifyBot/1.0; +https://apify.com/bot)",
            "prefill": "Mozilla/5.0 (compatible; ApifyBot/1.0; +https://apify.com/bot)",
            "editor": "textfield"
        },
        "proxyConfiguration": {
            "title": "Proxy configuration",
            "type": "object",
            "description": "Configure Apify Proxy or use your own proxies",
            "prefill": {
                "useApifyProxy": true
            },
            "editor": "proxy"
        }
    },
    "required": ["startUrl"]
}

src/main.js

1import path from 'node:path';
2
3import { Actor, log } from 'apify';
4import * as cheerio from 'cheerio';
5import got from 'got';
6import guard from 'robots-txt-guard';
7// eslint-disable-next-line import/no-extraneous-dependencies
8import parseRobots from 'robots-txt-parse';
9
10
11await Actor.init();
12
13const input = await Actor.getInput();
14const proxyConfiguration = await Actor.createProxyConfiguration(input.proxyConfiguration);
15const proxyUrl = proxyConfiguration.newUrl();
16
17log.info(`Using proxy: ${proxyUrl}`);
18
19const {
20    startUrl,
21    maxCrawlDepth = 1,
22    maxConcurrency = 10,
23    imageExtensions = ['jpg', 'jpeg', 'png', 'gif', 'webp', 'bmp', 'svg'],
24    respectRobotsTxt = true,
25    userAgent = 'Mozilla/5.0 (compatible; ApifyBot/1.0; +https://apify.com/bot)',
26    useScope = false,
27    scope = [],
28    includeSubdomains = false
29} = input;
30
31
32if (!startUrl) throw new Error('startUrl is required!');
33
34log.info('Starting Image URL Scraper', { startUrl, maxCrawlDepth, maxConcurrency });
35
36const robotsGuards = new Map();
37
38async function getGuardForUrl(url) {
39    const { hostname } = new URL(url);
40    if (!robotsGuards.has(hostname)) {
41        if (!respectRobotsTxt) {
42            robotsGuards.set(hostname, null);
43        } else {
44            try {
45                const robotsUrl = new URL('/robots.txt', `https://${hostname}`).href;
46                const robotsTxtText = await got(robotsUrl, {
47                    timeout: { request: 5000 },
48                    headers: { 'User-Agent': userAgent },
49                    agent: {
50                        http: proxyConfiguration?.newProxyAgent?.(),
51                        https: proxyConfiguration?.newProxyAgent?.()
52                    }
53                }).text();
54
55                const parsedRobots = parseRobots(robotsTxtText);
56                const guardInstance = guard(parsedRobots);
57
58                robotsGuards.set(hostname, guardInstance);
59            } catch (err) {
60                log.warning(`Could not load robots.txt for ${hostname}, allowing all by default. Error: ${err}`);
61                robotsGuards.set(hostname, null);
62            }
63        }
64    }
65    return robotsGuards.get(hostname);
66}
67
68
69const requestQueue = await Actor.openRequestQueue();
70await requestQueue.addRequest({ url: startUrl, userData: { depth: 0 } });
71
72const processedUrls = new Set();
73const foundImages = [];
74
75const crawlPage = async (request) => {
76    const { url, userData: { depth } } = request;
77
78    const guardian = await getGuardForUrl(url);
79    if (guardian && !guardian.isAllowed(url)) {
80        log.warning(`Blocked by robots.txt: ${url}`);
81        return;
82    }
83
84    if (processedUrls.has(url)) {
85        log.info(`Already processed ${url}, skipping.`);
86        return;
87    }
88    processedUrls.add(url);
89
90    log.info(`Processing ${url} (depth: ${depth})`);
91
92    let body;
93    try {
94        const response = await got(url, {
95            timeout: { request: 10000 },
96            headers: {
97                'User-Agent': userAgent,
98            },
99            agent: {
100                http: proxyConfiguration?.newProxyAgent?.(),
101                https: proxyConfiguration?.newProxyAgent?.()
102            }
103        });
104        const contentType = response.headers['content-type'] || '';
105        if (!contentType.includes('text/html')) {
106            log.info(`Skipping non-HTML content at ${url}`);
107            return;
108        }
109        body = response.body;
110    } catch (error) {
111        log.error(`Failed to download ${url}: ${error.message}`);
112        return;
113    }
114
115    const $ = cheerio.load(body);
116    const imagesOnPage = [];
117
118    $('img[src]').each((_, el) => {
119        const src = $(el).attr('src');
120        if (src) imagesOnPage.push(src);
121    });
122
123    $('[style]').each((_, el) => {
124        const style = $(el).attr('style') || '';
125        const match = style.match(/background-image:\s*url\(["']?([^"')]+)["']?\)/i);
126        if (match) imagesOnPage.push(match[1]);
127    });
128
129    const filteredImages = imagesOnPage
130        .map(src => {
131            try {
132                return new URL(src, url).href;
133            } catch {
134                return null;
135            }
136        })
137        .filter(src => src && imageExtensions.includes(path.extname(src).substring(1).toLowerCase()));
138
139    const uniqueImages = [...new Set(filteredImages)];
140
141    log.info(`Found ${uniqueImages.length} images on ${url}`);
142
143    for (const imgUrl of uniqueImages) {
144        foundImages.push({
145            url: imgUrl,
146            sourcePage: url,
147            detectedAt: new Date().toISOString(),
148        });
149    }
150
151    if (depth < maxCrawlDepth) {
152        const links = [];
153        $('a[href]').each((_, el) => {
154            const href = $(el).attr('href');
155            if (!href) return;
156            try {
157                const absoluteUrl = new URL(href, url).href;
158                links.push(absoluteUrl);
159                const ext = path.extname(absoluteUrl).substring(1).toLowerCase();
160                if (imageExtensions.includes(ext)) {
161                    foundImages.push({
162                        url: absoluteUrl,
163                        sourcePage: url,
164                        detectedAt: new Date().toISOString(),
165                    });
166                }
167            } catch { /* empty */ }
168        });
169
170        const uniqueLinks = [...new Set(links)];
171
172        function isUrlAllowedByScope(targetUrl) {
173            if (!useScope) return true;
174
175            try {
176                const { hostname } = new URL(targetUrl);
177                return scope.some(domain => {
178                    if (includeSubdomains) {
179                        return hostname === domain || hostname.endsWith(`.${domain}`);
180                        // eslint-disable-next-line
181                    } else {
182                        return hostname === domain;
183                    }
184                });
185            } catch {
186                return false;
187            }
188        }
189
190
191
192
193        for (const link of uniqueLinks) {
194            if (!processedUrls.has(link) && isUrlAllowedByScope(link)) {
195                await requestQueue.addRequest({
196                    url: link,
197                    userData: { depth: depth + 1 }
198                });
199            }
200        }
201
202        log.info(`Enqueued ${uniqueLinks.length} links from ${url}`);
203    }
204};
205
206const concurrency = Math.min(maxConcurrency, 20);
207const promises = [];
208
209for (let i = 0; i < concurrency; i++) {
210    promises.push((async () => {
211        while (true) {
212            const request = await requestQueue.fetchNextRequest();
213            if (!request) break;
214
215            try {
216                await crawlPage(request);
217                await requestQueue.markRequestHandled(request);
218            } catch (err) {
219                log.error(`Error crawling ${request.url}: ${err.message}`);
220                await requestQueue.markRequestFailed(request);
221            }
222        }
223    })());
224}
225
226await Promise.all(promises);
227
228for (const image of foundImages) {
229    await Actor.pushData({
230        url: image.url,
231        sourcePage: image.sourcePage,
232        foundAt: image.foundAt,
233    });
234}
235
236log.info('Crawl finished.');
237await Actor.exit();

Image Scraper

rapidtech1898/image-scraper

Scrape Image-Links from any website

Max Pohler

Website Image Downloader Pro (Pay per Result)

powerful_bachelor/website-image-downloader-pro-pay-per-result

📷 Website Image Downloader Pro: Scrape and download images effortlessly from any URL! 🌟 Features include extracting image URLs, converting SVG to PNG, downloading, and zipping images into one file. Ideal for research, AI training, and visual content archiving. 🖼️✨ Start now on Apify! 🚀

Powerful Bachelor

177

Website Image Downloader Pro

powerful_bachelor/website-image-downloader-pro

📸 Website Image Downloader Pro: Extract and download images from any URL! 🚀 Features include image URL extraction, SVG to PNG conversion, downloading, and zipping images. Perfect for market research, AI training, and creating visual archives. 🌐✨ Try it now on Apify! 💾

Powerful Bachelor

286

Bulk Image Downloader

trudax/bulk-image-downloader

Download all images from a website with this easy-to-use Bulk Image Downloader. Scrape all images from any website by URL to a zip file with a single click.

Gustavo Rudiger

2.8K

5.0

Example Image Download

lukaskrivka/download-image

Download a single image from a URL and store it into a key-value store.

Lukáš Křivka

175

Dataset Image Downloader & Uploader

lukaskrivka/images-download-upload

Download image files from image URLs in your datasets and save them to a Zip file, Key-Value store, or directly your AWS S3 bucket.

Lukáš Křivka

Image Downloader

apify/image-downloader

Apify

267

Google Images Scraper

hooli/google-images-scraper

Scrape image details from images.google.com. Add your query and number of images and extract image details such as image URL, image source, description, image dimensions, thumbnail, and more. Export scraped data, run the scraper via API, schedule and monitor runs, or integrate with other tools.

Hooli

2.8K

4.6

Bulk Image Downloader

onescales/bulk-image-downloader

The Bulk Image Downloader is a powerful Apify actor that extracts and downloads images from web pages or processes direct image URLs in bulk. Whether you need to download a single image or thousands of images from multiple websites, this tool handles it all efficiently.

One Scales

5.0

Website Media Link Scraper

thenetaji/website-media-link-scraper

Quickly find video, audio, docs, pdf, image and more links from websites using this fast and lightweight web crawler. No browser needed—just clean and efficient media extraction.