lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wun...@wunderwood.org>
Subject Re: SOLR Sizing
Date Thu, 06 Oct 2016 16:12:41 GMT
The square-root rule comes from a short paper draft (unpublished) that I can’t find right
now. But this paper gets the same result:

http://nflrc.hawaii.edu/rfl/April2005/chujo/chujo.html <http://nflrc.hawaii.edu/rfl/April2005/chujo/chujo.html>

Perfect OCR would follow this rule, but even great OCR has lots of errors. 95% accuracy is
good OCR performance, but that makes a huge, pathological long tail of non-language terms.

I learned about the OCR problems from the Hathi Trust. They hit the Solr vocabulary limit
of 2.4 billion terms, then when that was raise, they hit memory management issues.

https://www.hathitrust.org/blogs/large-scale-search/too-many-words <https://www.hathitrust.org/blogs/large-scale-search/too-many-words>
https://www.hathitrust.org/blogs/large-scale-search/too-many-words-again <https://www.hathitrust.org/blogs/large-scale-search/too-many-words-again>

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 6, 2016, at 8:05 AM, Rick Leir <rleir@leirtech.com> wrote:
> 
> I am curious to know where the square-root assumption is from, and why OCR (without errors)
would break it. TIA
> 
> cheers - - Rick
> 
> On 2016-10-04 10:51 AM, Walter Underwood wrote:
>> No, we don’t have OCR’ed text. But if you do, it breaks the assumption that vocabulary
size
>> is the square root of the text size.
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Oct 4, 2016, at 7:14 AM, Rick Leir <rleir@leirtech.com> wrote:
>>> 
>>> OCR’ed text can have large amounts of garbage such as '';,-d'." particularly
when there is poor image quality or embedded graphics. Is that what is causing your huge vocabularies?
I filtered the text, removing any word with fewer than 3 alphanumerics or more than 2 non-alphas.
>>> 
>>> 
>>> On 2016-10-03 09:30 PM, Walter Underwood wrote:
>>>> That approach doesn’t work very well for estimates.
>>>> 
>>>> Some parts of the index size and speed scale with the vocabulary instead
of the number of documents.
>>>> Vocabulary usually grows at about the square root of the total amount of
text in the index. OCR’ed text
>>>> breaks that estimate badly, with huge vocabularies.
>>>> 
>>>> 
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message