lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: OCR - Saving multi-term position
Date Wed, 02 Jul 2014 16:28:40 GMT
Problem here is that you wind up with a zillion unique terms in your
index, which may lead to performance issues, but you probably already
know that :).

I've seen situations where running it through a dictionary helps. That
is, does each term in the OCR match some dictionary? Problem here is
that it then de-values terms that don't happen to be in the
dictionary, names for instance.

But to answer your question: No, there really isn't a pre-built
analysis chain that i know of that does this. Root issue is how to
assign "confidence"? No clue for your specific domain.

So payloads seem quite reasonable here. Happens there's a recent
end-to-end example, see:
http://searchhub.org/2014/06/13/end-to-end-payload-example-in-solr/

Best,
Erick

On Wed, Jul 2, 2014 at 7:58 AM, Michael Della Bitta
<michael.della.bitta@appinions.com> wrote:
> I don't have first hand knowledge of how you implement that, but I bet a
> look at the WordDelimiterFilter would help you understand how to emit
> multiple terms with the same positions pretty easily.
>
> I've heard of this "bag of word variants" approach to indexing poor-quality
> OCR output before for findability reasons and I heard it works out OK.
>
> Michael Della Bitta
>
> Applications Developer
>
> o: +1 646 532 3062
>
> appinions inc.
>
> “The Science of Influence Marketing”
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions
> <https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
> w: appinions.com <http://www.appinions.com/>
>
>
> On Wed, Jul 2, 2014 at 10:19 AM, Manuel Le Normand <
> manuel.lenormand@gmail.com> wrote:
>
>> Hello,
>> Many of our indexed documents are scanned and OCR'ed documents.
>> Unfortunately we were not able to improve much the OCR quality (less than
>> 80% word accuracy) for various reasons, a fact which badly hurts the
>> retrieval quality.
>>
>> As we use an open-source OCR, we think of changing every scanned term
>> output to it's main possible variations to get a higher level of
>> confidence.
>>
>> Is there any analyser that supports this kind of need or should I make up a
>> syntax and analyser of my own, i.e the payload syntax?
>>
>> The quick brown fox --> The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3 fox|4
>>
>> Thanks,
>> Manuel
>>

Mime
View raw message