lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rick Leir <rl...@leirtech.com>
Subject Re: Using Tesseract OCR to extract PDF files in EML file attachment
Date Tue, 04 Apr 2017 06:19:59 GMT
Tesseract prolly knows nothing of the EML format. Your scripts could pull EML's apart.

On April 4, 2017 2:00:19 AM EDT, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com> wrote:
>Hi,
>
>Currently, I am able to extract scanned PDF images and index them to
>Solr
>using Tesseract OCR, although the speed is very slow.
>
>However, for EML files with PDF attachments that consist of scanned
>images,
>the Tesseract OCR is not able to extract the text from those PDF
>attachments.
>
>Can we use the same method for EML files? Or what are the suggestions
>that
>we can do to extract those attachments?
>
>I'm using Solr 6.5.0
>
>Regards,
>Edwin

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Mime
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message