
Bluesky Jetstream Scraper
Pricing
$25.00/month + usage

Bluesky Jetstream Scraper
Bluesky Social Feed Scraper collects posts from Bluesky's Jetstream API. Filter by hashtags, usernames, or languages to gather targeted data. Includes media attachments, user profiles, and reply context. Perfect for social research, trend analysis, and content monitoring on the platform.
0.0 (0)
Pricing
$25.00/month + usage
0
Monthly users
1
Runs succeeded
>99%
Last modified
3 days ago
🌊 Bluesky Jetstream Scraper
The Bluesky Jetstream Scraper is a tool built for Apify to collect and analyze real-time data from the Bluesky social network using the ATProto Firehose (Jetstream). This scraper allows you to filter posts by various criteria and customize the output format.
🔄 Jetstream vs. Crawling: This scraper uses Bluesky's Jetstream (firehose) API, which provides a continuous stream of real-time data directly from Bluesky's servers. Unlike traditional crawling methods that make numerous API requests to gather posts (which face rate limits and higher resource usage), the Jetstream approach is more efficient, providing access to the full stream of content as it's created without the limitations of crawling individual endpoints. This makes it ideal for large-scale data collection, trend analysis, and real-time monitoring.
⚠️ Real-Time vs. Historical: The Jetstream approach is designed for collecting current, real-time data only and is not suitable for historical data collection or analyzing posts over extended periods of time. It captures the content stream as it happens but cannot access posts from the past. If you need historical data analysis or content from specific time periods in the past, you would need to use different methods such as the Bluesky Query API (with appropriate rate limiting).
📣 Platform Notice: It's important to note that Bluesky and its API infrastructure are still evolving platforms. API specifications, data formats, and endpoints may change over time. While we strive to keep this scraper up-to-date with any platform changes, users should be aware that occasional updates may be necessary to maintain compatibility as the Bluesky ecosystem continues to develop.
📋 Input Schema Parameters
This section describes in detail how each input parameter affects the behavior of the scraper and the resulting output.
🔍 Filtering Parameters
hashtags
- Type: Array of strings
- Description: A list of hashtags to filter posts by (without the # symbol)
- Behavior: The scraper will only collect posts that contain at least one of the specified hashtags. When multiple hashtags are provided, posts matching ANY of these hashtags will be included (OR logic).
- Example: If you set
["apify", "scraping"]
, the output will include all posts containing either #apify OR #scraping.
usernames
- Type: Array of strings
- Description: A list of Bluesky usernames to filter posts by (will be resolved to DIDs for efficient filtering)
- Behavior: The scraper will only collect posts authored by the specified users. When multiple usernames are provided, posts from ANY of these users will be included (OR logic).
- Example: If you set
["user1.bsky.social", "user2.bsky.social"]
, the output will include all posts from either user1 OR user2.
languages
- Type: Array of strings
- Description: Languages to filter posts by (multiple selection allowed)
- Behavior: The scraper will only collect posts in the specified languages. When multiple languages are provided, posts in ANY of these languages will be included (OR logic). If a post doesn't have a language field, the scraper can auto-detect its language (if
detectLanguage
is enabled). - Example: If you set
["en", "pt"]
(English and Portuguese), the output will include all posts in either English OR Portuguese.
wantedCollections
- Type: Array of strings
- Description: Specific Bluesky collections to filter from Jetstream (defaults to feed posts)
- Behavior: Controls what types of content are collected from the Bluesky firehose. Options include:
app.bsky.feed.post
: Regular postsapp.bsky.feed.like
: Like interactionsapp.bsky.feed.repost
: Repost interactionsapp.bsky.graph.follow
: Follow relationshipsapp.bsky.graph.block
: Block relationshipsapp.bsky.actor.profile
: Profile updates
- Example: If you set
["app.bsky.feed.post", "app.bsky.feed.repost"]
, the output will include both original posts AND reposts.
📊 Content Inclusion Parameters
includeMedia
- Type: Boolean
- Description: Whether to include URLs for media attachments
- Behavior: When set to
true
, the output will include media URLs from posts. When set tofalse
, media URLs will be excluded, andmediaUrl
,mediaThumbnailUrl
fields will be empty,hasMedia
will be false, andmediaCount
will be 0. - Example: If set to
false
with a language filter of["pt"]
, the output will include Portuguese-language posts but without any media URLs or media-related fields populated.
includeImages
- Type: Boolean
- Description: Whether to include URLs for images in the output
- Behavior: When set to
true
, the output will include image URLs from posts. When set tofalse
, image URLs will be excluded, andimageUrl
field will be empty, andhasImages
will be false. - Example: If set to
false
, posts with images will still be included in the output, but image URLs won't be extracted or included in the result fields.
includeReplies
- Type: Boolean
- Description: Whether to include reply information in collected posts
- Behavior: When set to
true
, the output will include information about which posts are replies, and to which posts they are replying. When set tofalse
, this information will be excluded. - Example: If set to
true
, posts that are replies will haveisReply
set to true, along withreplyToRoot
andreplyToParent
fields containing the URIs of the root and parent posts.
🗣️ Language Settings
detectLanguage
- Type: Boolean
- Description: Whether to automatically detect the language of posts that don't specify one
- Behavior: When set to
true
, the scraper will use language detection to determine the language of posts that don't include language metadata. This is particularly useful when filtering by language. When set tofalse
, posts without language metadata will not match any language filter. - Example: If filtering for Japanese posts and this is set to
true
, posts without explicit language metadata might still be included if they contain Japanese text.
👤 User Profile Settings
enrichUserProfiles
- Type: Boolean
- Description: Whether to fetch additional user profile information for post authors
- Behavior: When set to
true
, the output will include extended information about post authors, such as their description, follower/following counts, post counts, and avatar URLs. When set tofalse
, only basic author information (DID, handle, name) will be included. - Example: If set to
true
, each post in the output will include additional fields likeauthorDescription
,authorFollowersCount
, etc.
⏱️ Data Collection Parameters
maxPosts
- Type: Integer
- Description: Maximum number of posts to collect (0 for unlimited)
- Behavior: Controls how many posts will be collected before the scraper stops. Setting to 0 means the scraper will continue until the time limit is reached.
- Example: If set to
100
, the scraper will stop after collecting 100 posts that match the filter criteria.
timeLimit
- Type: Integer
- Description: Maximum time to run the scraper in minutes
- Behavior: Controls how long the scraper will run before stopping, regardless of how many posts have been collected.
- Example: If set to
30
, the scraper will stop after 30 minutes, even if it hasn't reached themaxPosts
limit.
🔌 Connection Settings
region
- Type: String enum ("us-east" or "us-west")
- Description: Region for the Jetstream server
- Behavior: Controls which regional Bluesky Jetstream server the scraper connects to. This can affect latency and potentially the volume of data received.
- Example: If you're collecting data from the US West Coast, selecting
us-west
might provide lower latency.
instance
- Type: Integer (1 or 2)
- Description: Instance number for the Jetstream server
- Behavior: Selects which specific Jetstream instance to connect to within the selected region.
- Example: If experiencing connection issues with instance 1, switching to instance 2 might help.
autoReconnect
- Type: Boolean
- Description: Whether to automatically reconnect if the connection is lost
- Behavior: When set to
true
, the scraper will attempt to reconnect to Jetstream if the connection drops. When set tofalse
, the scraper will terminate on connection loss. - Example: For long-running data collection jobs, setting this to
true
helps ensure continuous data collection despite temporary network issues.
maxRetries
- Type: Integer
- Description: Maximum number of reconnection attempts
- Behavior: Controls how many times the scraper will try to reconnect before giving up.
- Example: If set to
5
, the scraper will make up to 5 reconnection attempts before terminating.
⚙️ Advanced Settings
saveCheckpoints
- Type: Boolean
- Description: Whether to periodically save collected data to prevent loss on errors
- Behavior: When set to
true
, the scraper will periodically save collected data to disk, allowing recovery from a checkpoint if the process is interrupted. - Example: If set to
true
and the scraper crashes after collecting 400 posts, you might be able to recover 350 of them from the last checkpoint.
proxy
- Type: Object
- Description: Proxy configuration for the scraper
- Behavior: Controls whether and how the scraper uses Apify proxies for connections.
- Example: Setting
useApifyProxy
totrue
allows the scraper to use Apify's proxy infrastructure, which can help avoid rate limiting.
debugMode
- Type: Boolean
- Description: Whether to enable detailed logging for troubleshooting
- Behavior: When set to
true
, the scraper will output more detailed logs about its operation, which can help diagnose issues. - Example: If you're not seeing the expected output, setting this to
true
can provide insights into what's happening.
verboseDebug
- Type: Boolean
- Description: Whether to enable extremely detailed logging for message format diagnostics
- Behavior: When set to
true
, the scraper will output extremely detailed logs, including raw message contents. This generates large log files. - Example: Useful only for advanced debugging when developing or modifying the scraper.
🎨 Customizing Output Format
The scraper allows you to customize the data fields included in the output through several parameters:
Field Selection Controls
These parameters control which data fields are included in the output:
includeMedia
: Controls whether media URLs and related fields are includedincludeImages
: Controls whether image URLs and related fields are includedincludeReplies
: Controls whether reply information fields are includedenrichUserProfiles
: Controls whether extended author profile fields are included
Output Format Options
On the Apify platform, you can download your dataset in several formats:
- JSON: The default format with complete data structure
- CSV: Tabular format suitable for spreadsheet applications
- Excel: Direct Excel file download
- RSS: For feed readers
- HTML: For web viewing
To change the download format:
- Navigate to the "Storage" tab in your Apify account
- Select the dataset from your actor run
- Click the "Download" dropdown menu
- Choose your preferred format
For customized data processing, you can also use the Apify API to retrieve the data programmatically in your preferred format.
🔄 Combining Filters
When multiple filter types are used together (hashtags, usernames, languages), the scraper applies AND logic between different filter types:
- If you set both hashtags and languages, posts must match BOTH criteria (contain one of the hashtags AND be in one of the languages).
- If you set both usernames and languages, posts must be authored by one of the specified users AND be in one of the specified languages.
⚪ Default Behavior (No Filters)
When no filter options (hashtags, usernames, languages) are selected:
- The scraper will collect all posts from the Bluesky Jetstream without any filtering
- All posts will match the filter criteria automatically
- The only limits will be the
maxPosts
parameter and/or thetimeLimit
parameter - You'll get a diverse, unfiltered stream of Bluesky content
- Other inclusion settings like
includeMedia
andincludeImages
will still be applied - Collection types will be limited to what's specified in
wantedCollections
(defaults to feed posts)
This approach is useful for general data collection when you want to analyze the overall Bluesky content without focusing on specific topics, users, or languages.
📝 Example Scenarios
🤝 Bluesky Firehose Scraping Etiquette
When using the Bluesky Jetstream (firehose), it's important to follow these ethical guidelines and best practices:
📜 Official Guidelines
- Respect the Terms of Service: Always adhere to Bluesky's official Terms of Service and API Usage Guidelines.
- Attribution: When publishing research or analysis based on Bluesky data, properly attribute the source.
- Privacy Awareness: Though the data is publicly available, be mindful that users may not expect their content to be analyzed at scale.
🔧 Technical Best Practices
- Rate Limiting: The scraper already implements rate limiting, but be cautious about running multiple instances simultaneously.
- Efficient Filtering: Use the filtering options to collect only the data you need rather than scraping everything.
- Connection Management: Use the
autoReconnect
andmaxRetries
settings responsibly to avoid creating excessive connection attempts. - Data Storage: Handle collected data securely and in compliance with relevant privacy regulations like GDPR.
🔍 Responsible Usage
- Research Purpose: Clearly define your research or business purpose before collecting data.
- Minimize Collection: Only collect the data fields necessary for your analysis.
- Respect Boundaries: Avoid excessive scraping that might impact the platform's performance.
- Consider Opt-Out: When presenting results, consider providing ways for users to opt-out of having their content included.
⚖️ Legal Considerations
- Data Protection: Comply with applicable data protection laws in your jurisdiction.
- User Privacy: Even though posts are public, respect user privacy by anonymizing data when possible.
- Terms Changes: Regularly check for updates to Bluesky's terms as the platform is evolving.
Following these guidelines ensures ethical use of the Bluesky firehose while maintaining a positive relationship with the platform and its community.
Pricing
Pricing model
RentalTo use this Actor, you have to pay a monthly rental fee to the developer. The rent is subtracted from your prepaid usage every month after the free trial period. You also pay for the Apify platform usage.
Free trial
1 day
Price
$25.00