PDF Text Extractor avatar

PDF Text Extractor

Try for free

No credit card required

Go to Store
PDF Text Extractor

PDF Text Extractor

jirimoravcik/pdf-text-extractor
Try for free

No credit card required

PDF Text Extractor allows you to extract text from PDF files. It also supports chunking of the text to prepare the data for usage with large language models.

RE

Data Format Error

Closed

Regscanner opened this issue
a year ago

when crawling I receive the following error: 2024-02-14T06:07:31.285Z ACTOR: Pulling Docker image of build 5lFVfc3pf7JN70PcE from repository. 2024-02-14T06:07:35.343Z ACTOR: Creating Docker container. 2024-02-14T06:07:35.641Z ACTOR: Starting Docker container. 2024-02-14T06:07:37.360Z INFO Initializing actor... 2024-02-14T06:07:37.363Z INFO System info ({"apify_sdk_version": "1.1.5", "apify_client_version": "1.4.1", "python_version": "3.11.7", "os": "linux"}) 2024-02-14T06:07:37.628Z --- Logging error --- 2024-02-14T06:07:37.629Z Traceback (most recent call last): 2024-02-14T06:07:37.631Z File "/usr/src/app/src/main.py", line 15, in main 2024-02-14T06:07:37.633Z pdf_document = pdfium.PdfDocument(io.BytesIO(pdf.content)) 2024-02-14T06:07:37.634Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-02-14T06:07:37.636Z File "/usr/local/lib/python3.11/site-packages/pypdfium2/_helpers/document.py", line 77, in init 2024-02-14T06:07:37.638Z self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose) 2024-02-14T06:07:37.639Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-02-14T06:07:37.641Z File "/usr/local/lib/python3.11/site-packages/pypdfium2/_helpers/document.py", line 744, in _open_pdf 2024-02-14T06:07:37.642Z raise PdfiumError(f"Failed to load document (PDFium: {pdfium_i.ErrorToStr.get(err_code)}).") 2024-02-14T06:07:37.644Z pypdfium2._helpers.misc.Pdfi... [trimmed]

jirimoravcik avatar

Hi, sadly it seems that PDFium has problems parsing the PDF file you provided. Can you try some other files to see if it is caused by that specific file?

Developer
Maintained by Community

Actor Metrics

  • 40 monthly users

  • 20 stars

  • >99% runs succeeded

  • Created in Oct 2023

  • Modified 4 months ago