lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AJ Weber <awe...@comcast.net>
Subject Re: Using Tesseract OCR to extract PDF files in EML file attachment
Date Tue, 04 Apr 2017 13:12:36 GMT
You'll need to use something like javax mail (or some of the jars that 
have been built on top of it for higher-level access) to open the EML 
files and extract the attachments, then operate on the extracted 
attachments as you would any file.

There are alternative, paid, libraries to parse and extract attachments 
from EML files as well.

EML attachments will have a mimetype associated with their metadata.



On 4/4/2017 2:00 AM, Zheng Lin Edwin Yeo wrote:
> Hi,
>
> Currently, I am able to extract scanned PDF images and index them to Solr
> using Tesseract OCR, although the speed is very slow.
>
> However, for EML files with PDF attachments that consist of scanned images,
> the Tesseract OCR is not able to extract the text from those PDF
> attachments.
>
> Can we use the same method for EML files? Or what are the suggestions that
> we can do to extract those attachments?
>
> I'm using Solr 6.5.0
>
> Regards,
> Edwin
>


Mime
View raw message