PDF Text Extractor avatar

PDF Text Extractor

Try for free

No credit card required

View all Actors
PDF Text Extractor

PDF Text Extractor

jirimoravcik/pdf-text-extractor
Try for free

No credit card required

PDF Text Extractor allows you to extract text from PDF files. It also supports chunking of the text to prepare the data for usage with large language models.

RE

Data Format Error

Closed

Regscanner opened this issue
7 months ago

when crawling I receive the following error: 2024-02-14T06:07:31.285Z ACTOR: Pulling Docker image of build 5lFVfc3pf7JN70PcE from repository. 2024-02-14T06:07:35.343Z ACTOR: Creating Docker container. 2024-02-14T06:07:35.641Z ACTOR: Starting Docker container. 2024-02-14T06:07:37.360Z INFO Initializing actor... 2024-02-14T06:07:37.363Z INFO System info ({"apify_sdk_version": "1.1.5", "apify_client_version": "1.4.1", "python_version": "3.11.7", "os": "linux"}) 2024-02-14T06:07:37.628Z --- Logging error --- 2024-02-14T06:07:37.629Z Traceback (most recent call last): 2024-02-14T06:07:37.631Z File "/usr/src/app/src/main.py", line 15, in main 2024-02-14T06:07:37.633Z pdf_document = pdfium.PdfDocument(io.BytesIO(pdf.content)) 2024-02-14T06:07:37.634Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-02-14T06:07:37.636Z File "/usr/local/lib/python3.11/site-packages/pypdfium2/_helpers/document.py", line 77, in init 2024-02-14T06:07:37.638Z self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose) 2024-02-14T06:07:37.639Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-02-14T06:07:37.641Z File "/usr/local/lib/python3.11/site-packages/pypdfium2/_helpers/document.py", line 744, in _open_pdf 2024-02-14T06:07:37.642Z raise PdfiumError(f"Failed to load document (PDFium: {pdfium_i.ErrorToStr.get(err_code)}).") 2024-02-14T06:07:37.644Z pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: Data format error). 2024-02-14T06:07:37.646Z 2024-02-14T06:07:37.647Z During handling of the above exception, another exception occurred: 2024-02-14T06:07:37.649Z 2024-02-14T06:07:37.650Z Traceback (most recent call last): 2024-02-14T06:07:37.652Z File "/usr/local/lib/python3.11/logging/init.py", line 1110, in emit 2024-02-14T06:07:37.654Z msg = self.format(record) 2024-02-14T06:07:37.656Z ^^^^^^^^^^^^^^^^^^^ 2024-02-14T06:07:37.657Z File "/usr/local/lib/python3.11/logging/init.py", line 953, in format 2024-02-14T06:07:37.659Z return fmt.format(record) 2024-02-14T06:07:37.661Z ^^^^^^^^^^^^^^^^^^ 2024-02-14T06:07:37.663Z File "/usr/local/lib/python3.11/logging/init.py", line 687, in format 2024-02-14T06:07:37.665Z record.message = record.getMessage() 2024-02-14T06:07:37.672Z ^^^^^^^^^^^^^^^^^^^ 2024-02-14T06:07:37.674Z File "/usr/local/lib/python3.11/logging/init.py", line 377, in getMessage 2024-02-14T06:07:37.675Z msg = msg % self.args 2024-02-14T06:07:37.678Z ~~~~^~~~~~~~~~~ 2024-02-14T06:07:37.680Z TypeError: not all arguments converted during string formatting 2024-02-14T06:07:37.682Z Call stack: 2024-02-14T06:07:37.683Z File "", line 198, in _run_module_as_main 2024-02-14T06:07:37.685Z File "", line 88, in _run_code 2024-02-14T06:07:37.687Z File "/usr/src/app/src/main.py", line 20, in 2024-02-14T06:07:37.688Z asyncio.run(main()) 2024-02-14T06:07:37.690Z File "/usr/local/lib/python3.11/asyncio/runners.py", line 190, in run 2024-02-14T06:07:37.692Z return runner.run(main) 2024-02-14T06:07:37.693Z File "/usr/local/lib/python3.11/asyncio/runners.py", line 118, in run 2024-02-14T06:07:37.695Z return self._loop.run_until_complete(task) 2024-02-14T06:07:37.697Z File "/usr/local/lib/python3.11/asyncio/base_events.py", line 640, in run_until_complete 2024-02-14T06:07:37.698Z self.run_forever() 2024-02-14T06:07:37.700Z File "/usr/local/lib/python3.11/asyncio/base_events.py", line 607, in run_forever 2024-02-14T06:07:37.702Z self._run_once() 2024-02-14T06:07:37.703Z File "/usr/local/lib/python3.11/asyncio/base_events.py", line 1922, in _run_once 2024-02-14T06:07:37.705Z handle._run() 2024-02-14T06:07:37.707Z File "/usr/local/lib/python3.11/asyncio/events.py", line 80, in _run 2024-02-14T06:07:37.708Z self._context.run(self._callback, *self._args) 2024-02-14T06:07:37.710Z File "/usr/src/app/src/main.py", line 36, in main 2024-02-14T06:07:37.712Z logging.error(f'Could not process URL {url} due to exception', e) 2024-02-14T06:07:37.714Z Message: 'Could not process URL https://legiscan.com/OK/text/SB1401/id/2927028/Oklahoma-2024-SB1401-Amended.pdf due to exception' 2024-02-14T06:07:37.716Z Arguments: (PdfiumError('Failed to load document (PDFium: Data format error).'),) 2024-02-14T06:07:37.717Z INFO Exiting actor ({"exit_code": 0}) 2024-02-14T06:07:37.719Z INFO:apify:Exiting actor

jirimoravcik avatar

Hi, sadly it seems that PDFium has problems parsing the PDF file you provided. Can you try some other files to see if it is caused by that specific file?

Developer
Maintained by Community
Actor metrics
  • 38 monthly users
  • 14 stars
  • 100.0% runs succeeded
  • 7.4 hours response time
  • Created in Oct 2023
  • Modified 5 months ago