lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rick Leir <>
Subject Re: Using Tesseract OCR to extract PDF files in EML file attachment
Date Tue, 04 Apr 2017 06:19:59 GMT
Tesseract prolly knows nothing of the EML format. Your scripts could pull EML's apart.

On April 4, 2017 2:00:19 AM EDT, Zheng Lin Edwin Yeo <> wrote:
>Currently, I am able to extract scanned PDF images and index them to
>using Tesseract OCR, although the speed is very slow.
>However, for EML files with PDF attachments that consist of scanned
>the Tesseract OCR is not able to extract the text from those PDF
>Can we use the same method for EML files? Or what are the suggestions
>we can do to extract those attachments?
>I'm using Solr 6.5.0

Sent from my Android device with K-9 Mail. Please excuse my brevity.
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message