OpenAI Vector Store Integration avatar

OpenAI Vector Store Integration

Try for free

No credit card required

Go to Store
OpenAI Vector Store Integration

OpenAI Vector Store Integration

jiri.spilka/openai-vector-store-integration
Try for free

No credit card required

The Apify OpenAI Vector Store integration uploads data from Apify Actors to the OpenAI Vector Store linked to OpenAI Assistant.

Do you want to learn more about this Actor?

Get a demo
CN

Error due to exceeding 'file_ids' array length limit when processing large datasets

Closed

cankat opened this issue
2 months ago

Hello,

I've been using the OpenAI Vector Store Integration actor to process a large dataset of scraped pages from a previous Apify run. The dataset contains over 9,000 items, but when the actor attempts to create the vector store batch, it encounters an error due to exceeding the maximum allowed file_ids array length.

It appears that the actor tries to send all the file_ids in a single request to OpenAI's API, exceeding the limit of 500 file_ids per request. As a result, the actor fails to process the entire dataset, and only a portion of the data is inserted into the vector store.

To experince that issue, run a web scraping task that collects a large number of results (e.g., over 9,000 items). Use the OpenAI Vector Store Integration actor to process the dataset. Observe that the actor fails with the above error message.

The actor should handle large datasets by batching the file_ids into chunks of 500 or fewer when making the create_and_poll API requests to OpenAI. This would comply with OpenAI's API limitations and allow processing of large datasets without encountering the "array too long" error.

Run URL: https://console.apify.com/actors/runs/paRZQeERG1bHj3CqQ

Run URL Log: https://api.apify.com/v2/logs/paRZQeERG1bHj3CqQ

jiri.spilka avatar

Hi, thank you for using the OpenAI Integration!

And thank you for the excellent explanations and examples—they were very helpful in quickly identifying the issue.

The fix is straightforward. I’ll bundle it with a few other changes I’ve been planning, test it, and release it tomorrow.

jiri.spilka avatar

Hi, I’ve implemented the changes, and the Actor can now handle batch operations, but during testing on my crawl, the files were created, but attaching them to the vector store failed without providing a clear reason.

Here’s my log:

VectorStoreFileBatch(id='vsfb_d3d0bdb8cf1f4a8987514d91b1208e84', created_at=1732652119, file_counts=FileCounts(cancelled=0, completed=203, failed=297)

I’m afraid you might encounter a similar issue. I’ll need to investigate this further. My apologies for the inconvenience.

jiri.spilka avatar

Hi, Thank you again for pointing out this issue.

I was able to fix it. The problem was that OpenAI doesn’t handle large batches of files well. I had to reduce the batch size to 100 to avoid many failures.

However, I still couldn’t determine the exact reason for some of the failures. To address this, I modified the code to upload files to the vector store one by one. It turns out that OpenAI cannot process PDF files that are represented as images (e.g., scanned PDFs). Only text-based PDFs can be added to the vector store.

In the latest version (0.0.38), you now get detailed output that allows you to examine which files failed to upload to the vector store. The trade-off is that the upload to the OpenAI Vector Store is now slower than before.

I hope this helps.

I’ll go ahead and close this issue. If you encounter any further difficulties or have additional questions, feel free to reach out.

Developer
Maintained by Apify

Actor Metrics

  • 15 monthly users

  • 9 stars

  • 92% runs succeeded

  • 2.8 days response time

  • Created in Apr 2024

  • Modified 2 months ago

Categories