lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charlie Hull <char...@flax.co.uk>
Subject Re: Fwd: configuring Solr with Tesseract
Date Mon, 06 Nov 2017 09:05:42 GMT
On 03/11/2017 15:32, Admin eLawJournal wrote:
> Hi,
> I have read that we can use tesseract with solr to index image files. I
> would like some guidance on setting this up.
> 
> Currently, I am using solr for searching my wordpress installation via the
> WPSOLR plugin.
> 
> I have Solr 6.6 installed on ubuntu 14.04 which is working fine with
> wordpress.
> 
> I have also installed tesseract but have no clue on configuring it.
> 
> 
> I am new to solr so will greatly appreciate a detailed step by step
> instruction.

Hi,

I'm guessing if you're using a preconfigured Solr plugin for WP you 
probably haven't got your hands properly dirty with Solr yet.

One way to use Tesseract would be via Apache Tika 
https://wiki.apache.org/tika/TikaOCR which is an awesome library for 
extracting plain text from many different document formats and types. 
There's a direct way to use Tesseract from within Solr (the 
ExtractingRequestHandler 
https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html#uploading-data-with-solr-cell-using-apache-tika)

but we don't generally recommend this, as dodgy files can sometimes eat 
all your resources during parsing and if Tika dies then so does Solr. We 
usually process the files externally and the feed them to Solr using its 
HTTP API.

Here's one way to do it - a simple server wrapper around Tika 
https://github.com/mattflax/dropwizard-tika-server written by my 
colleague Matt Pearce.

So you're going to need to do some coding I think - Python would be a 
good choice - to feed your source files to Tika for OCR and extraction, 
and then the resulting text to Solr for indexing.

Cheers

Charlie

> 
> Thank you very much
> 


-- 
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Mime
View raw message