Merge, Dedup & Transform Datasets avatar

Merge, Dedup & Transform Datasets

Try for free

No credit card required

View all Actors
Merge, Dedup & Transform Datasets

Merge, Dedup & Transform Datasets

lukaskrivka/dedup-datasets
Try for free

No credit card required

The ultimate dataset processor. Extremely fast merging, deduplications & transformations all in a single run.

Do you want to learn more about this Actor?

Get a demo
GR

Merge and Dedup to a single dataset

Open

graphext opened this issue
a month ago

Hi! This is more of a feature request (or maybe help in case it is already possible) to perform merging and deduplication to a single dataset.

Lemme illustrate this with an example. Say I have a schedule that runs a defined Task every 2 minutes. This schedule will run the task and create a small, unnamed dataset every 2 minutes.

Let's say that I want to merge all these small datasets that get created into one, single, well defined and named dataset that I created myself for this purpose. This ID will be set in the "outputDatasetId".

Using your beloved tool, I pass in the {{resource.datsetId}} to the input, AS WELL as the big, named dataset. I also pass a "url" parameter to the "fields", to perform deduplication. I use webooks to run this every time the run is successful.

This will perform deduplication contrasting the small, new dataset against the old, big datset. This difference will then be pushed into the old datset. The problem is that this creates duplicates again and again in the old dataset. Imagine that the big, named dataset and the new, small dataset have no items in common. That means the "difference" dataset will be the union of both sets. This will be pushed to the big dataset, creating duplicates of all already existing items, as well as adding the small number of new items. I hope I am getting my point accross, but feel free to ask.

Essentially, this would not happen if we were able to either:

  • perform deduplication against the outputDataset
  • or instead of pushing to the dataset, completely overwriting it.

So, here I ask: is there a way of achieving what I am trying to do with the current state of your tool? Which would essentially mean I'm using it incorrectly.

Or, in the case it's not possible, is there any chance we can implement any of these methods?

That's it! Sorry for the long message, I hope it explains the problem well, though.

Thanks so much beforehand for your time and for your amazing tool.

Have a great day!

—Jesús

paja avatar

Hi,

thanks for reaching out, we'll look into it and let you know what can be done.

Developer
Maintained by Apify
Actor metrics
  • 831 monthly users
  • 49 stars
  • 99.8% runs succeeded
  • 3.6 days response time
  • Created in Apr 2020
  • Modified 9 days ago