BeautifulSoup Scraper
No credit card required
BeautifulSoup Scraper
No credit card required
Crawls websites using raw HTTP requests. It parses the HTML with the BeautifulSoup library and extracts data from the pages using Python code. Supports both recursive crawling and lists of URLs. This Actor is a Python alternative to Cheerio Scraper.
Do you want to learn more about this Actor?
Get a demoMax crawling depth
maxCrawlingDepth
integerOptional
Specifies how many links away from the Start URLs the scraper will descend. Note that pages added using context.request_queue
in Page function are not subject to the maximum depth constraint.
Default value of this property is 1
Request timeout
requestTimeout
integerOptional
The maximum duration (in seconds) for the request to complete before timing out. The timeout value is passed to the httpx.AsyncClient
object.
Default value of this property is 10
Link selector
linkSelector
stringOptional
A CSS selector stating which links on the page (<a>
elements with href
attribute) shall be followed and added to the request queue. To filter the links added to the queue, use the Link patterns field.
If the Link selector is empty, the page links are ignored. Of course, you can work with the page links and the request queue in the Page function as well.
Link patterns
linkPatterns
arrayOptional
Link patterns (regular expressions) to match links in the page that you want to enqueue. Combine with Link selector to tell the scraper where to find links. Omitting the link patterns will cause the scraper to enqueue all links matched by the Link selector.
Page function
pageFunction
stringRequired
A Python function, that is executed for every page. Use it to scrape data from the page, perform actions or add new URLs to the request queue. The page function has its own naming scope and you can import any installed modules. Typically you would want to obtain the data from the context.soup
object and return them. Identifier page_function
can't be changed. For more information about the context
object you get into the page_function
check the github.com/apify/actor-beautifulsoup-scraper#context. Asynchronous functions are supported.
BeautifulSoup features
soupFeatures
stringOptional
The value of BeautifulSoup features
argument. From BeautifulSoup docs: Desirable features of the parser to be used. This may be the name of a specific parser ("lxml", "lxml-xml", "html.parser", or "html5lib") or it may be the type of markup to be used ("html", "html5", "xml"). It's recommended that you name a specific parser, so that Beautiful Soup gives you the same results across platforms and virtual environments.
BeautifulSoup from_encoding
soupFromEncoding
stringOptional
The value of BeautifulSoup from_encoding
argument. From BeautifulSoup docs: A string indicating the encoding of the document to be parsed. Pass this in if Beautiful Soup is guessing wrongly about the document's encoding.
BeautifulSoup exclude_encodings
soupExcludeEncodings
arrayOptional
The value of BeautifulSoup exclude_encodings
argument. From BeautifulSoup docs: A list of strings indicating encodings known to be wrong. Pass this in if you don't know the document's encoding but you know Beautiful Soup's guess is wrong.
Actor Metrics
23 monthly users
-
4 stars
95% runs succeeded
Created in Jul 2023
Modified 2 months ago