lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Using Tesseract OCR to extract PDF files in EML file attachment
Date Tue, 04 Apr 2017 06:00:19 GMT
Hi,

Currently, I am able to extract scanned PDF images and index them to Solr
using Tesseract OCR, although the speed is very slow.

However, for EML files with PDF attachments that consist of scanned images,
the Tesseract OCR is not able to extract the text from those PDF
attachments.

Can we use the same method for EML files? Or what are the suggestions that
we can do to extract those attachments?

I'm using Solr 6.5.0

Regards,
Edwin

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message