Merge, Dedup & Transform Datasets avatar
Merge, Dedup & Transform Datasets

Pricing

Pay per usage

Go to Store
Merge, Dedup & Transform Datasets

Merge, Dedup & Transform Datasets

Developed by

Lukáš Křivka

Maintained by Community

The ultimate dataset processor. Extremely fast merging, deduplications & transformations all in a single run.

0.0 (0)

Pricing

Pay per usage

77

Monthly users

125

Runs succeeded

97%

Response time

20 days

Last modified

3 months ago

SE

Dedup Actor in dedup-as-loading mode reprocesses all historical runs every time

Open

segalmax opened this issue
4 days ago

I'm encountering unexpected behavior when using the dedup actor (mode: dedup-as-loading) with the configuration below. Every time the actor runs after my scraper finishes, it reprocesses and reloads all historical dataset items—even those that have been deduplicated in previous executions. As a result, the output dataset is populated with records from each previous run each time the dedup actor is triggered, rather than only new unique records.

Configuration:

json Copy { "actorOrTaskId": "segalmax/fb-groups-scraper-task", "appendDatasetIds": false, "fields": [ "url" ], "mode": "dedup-as-loading", "nullAsUnique": false, "outputDatasetId": "deduped-facebook-posts", "postDedupTransformFunction": "async (items, { Apify }) => {\n return items;\n}", "preDedupTransformFunction": "async (items, { Apify }) => {\n return items;\n}", "verboseLog": false } Observed Behavior:

The dedup actor loads around 75 runs (each containing small datasets—most with 5 items) every time it is triggered.

The log shows that the dedup set starts at zero unique keys for each execution, and then it processes items from all historical runs.

As a result, even though within each single execution the deduplication removes duplicates, the actor does not maintain a persistent state between executions. This causes reprocessing of historical data on every run.

I observe log messages such as:

scss Copy INFO Loaded 75 runs (...) INFO Loaded deduplicating set, currently contains 0 unique keys (already deduplicated items) INFO Dataset [dataset_id] has 5 items which clearly indicates that the dedup actor rebuilds the state from scratch each time.

Expected Behavior:

I expected the dedup actor to persist or recognize already-processed dataset IDs across runs, so that subsequent executions only add records from new runs instead of reprocessing historical runs. Alternatively, if the intention is that the dedup actor processes only the new dataset provided (by, for example, setting appendDatasetIds to true or filtering for new runs), the current behavior (with appendDatasetIds: false) is not meeting that goal.

Steps to Reproduce:

Run the Facebook scraper task (segalmax/fb-groups-scraper-task) multiple times to accumulate several runs (each producing a dataset with around 5 items).

Trigger the dedup actor with the above configuration after the scraper finishes.

Observe in the log that the actor:

Loads all historical runs (about 75).

Rebuilds the deduplication state starting from zero unique keys.

Reprocesses datasets from all runs, leading to re-adding records that have already been deduplicated.

The output dataset (deduped-facebook-posts) ends up with the records from all runs being processed again.

Environment Details:

Apify version: 2.3.2

Apify client version: 2.9.5

Node version: v16.20.2

Dedup actor mode: dedup-as-loading

Additional Notes:

I understand that appendDatasetIds: false makes the actor load all dataset IDs each time. However, I intended to have an incremental deduplication process where only new unique records (or only new runs) are processed.

If there is a recommended way to configure the actor for incremental deduplication, please advise. Otherwise, it seems there might be a bug or missing feature regarding state persistence between executions.

Please let me know if any further information or logs are required. Thanks for your help!

Pricing

Pricing model

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage.