lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <>
Subject Re: SOLR Sizing
Date Thu, 06 Oct 2016 16:12:41 GMT
The square-root rule comes from a short paper draft (unpublished) that I can’t find right
now. But this paper gets the same result: <>

Perfect OCR would follow this rule, but even great OCR has lots of errors. 95% accuracy is
good OCR performance, but that makes a huge, pathological long tail of non-language terms.

I learned about the OCR problems from the Hathi Trust. They hit the Solr vocabulary limit
of 2.4 billion terms, then when that was raise, they hit memory management issues. <> <>

Walter Underwood  (my blog)

> On Oct 6, 2016, at 8:05 AM, Rick Leir <> wrote:
> I am curious to know where the square-root assumption is from, and why OCR (without errors)
would break it. TIA
> cheers - - Rick
> On 2016-10-04 10:51 AM, Walter Underwood wrote:
>> No, we don’t have OCR’ed text. But if you do, it breaks the assumption that vocabulary
>> is the square root of the text size.
>> wunder
>> Walter Underwood
>>  (my blog)
>>> On Oct 4, 2016, at 7:14 AM, Rick Leir <> wrote:
>>> OCR’ed text can have large amounts of garbage such as '';,-d'." particularly
when there is poor image quality or embedded graphics. Is that what is causing your huge vocabularies?
I filtered the text, removing any word with fewer than 3 alphanumerics or more than 2 non-alphas.
>>> On 2016-10-03 09:30 PM, Walter Underwood wrote:
>>>> That approach doesn’t work very well for estimates.
>>>> Some parts of the index size and speed scale with the vocabulary instead
of the number of documents.
>>>> Vocabulary usually grows at about the square root of the total amount of
text in the index. OCR’ed text
>>>> breaks that estimate badly, with huge vocabularies.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message