lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wun...@wunderwood.org>
Subject Re: SOLR Sizing
Date Tue, 04 Oct 2016 14:51:29 GMT
No, we don’t have OCR’ed text. But if you do, it breaks the assumption that vocabulary
size
is the square root of the text size.

wunder 
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 4, 2016, at 7:14 AM, Rick Leir <rleir@leirtech.com> wrote:
> 
> OCR’ed text can have large amounts of garbage such as '';,-d'." particularly when there
is poor image quality or embedded graphics. Is that what is causing your huge vocabularies?
I filtered the text, removing any word with fewer than 3 alphanumerics or more than 2 non-alphas.
> 
> 
> On 2016-10-03 09:30 PM, Walter Underwood wrote:
>> That approach doesn’t work very well for estimates.
>> 
>> Some parts of the index size and speed scale with the vocabulary instead of the number
of documents.
>> Vocabulary usually grows at about the square root of the total amount of text in
the index. OCR’ed text
>> breaks that estimate badly, with huge vocabularies.
>> 
>> Also, it is common to find non-linear jumps in performance. I’m benchmarking a
change in a 12 million
>> document index. It improves the 95th percentile response time for one style of query
from 3.8 seconds
>> to 2 milliseconds. I’m testing with a log of 200k queries from a production host,
so I’m pretty sure that
>> is accurate.
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Oct 3, 2016, at 6:02 PM, Susheel Kumar <susheel2777@gmail.com> wrote:
>>> 
>>> In short, if you want your estimate to be closer then run some actual
>>> ingestion for say 1-5% of your total docs and extrapolate since every
>>> search product may have different schema,different set of fields, different
>>> index vs. stored fields,  copy fields, different analysis chain etc.
>>> 
>>> If you want to just have a very quick rough estimate, create few flat json
>>> sample files (below) with field names and key values(actual data for better
>>> estimate). Put all the fields names which you are going to index/put into
>>> Solr and check the json file size. This will give you average size of a doc
>>> and then multiply with # docs to get a rough index size.
>>> 
>>> {
>>> "id":"product12345"
>>> "name":"productA",
>>> "category":"xyz",
>>> ...
>>> ...
>>> }
>>> 
>>> Thanks,
>>> Susheel
>>> 
>>> On Mon, Oct 3, 2016 at 3:19 PM, Allison, Timothy B. <tallison@mitre.org>
>>> wrote:
>>> 
>>>> This doesn't answer your question, but Erick Erickson's blog on this topic
>>>> is invaluable:
>>>> 
>>>> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-
>>>> the-abstract-why-we-dont-have-a-definitive-answer/
>>>> 
>>>> -----Original Message-----
>>>> From: Vasu Y [mailto:vyal2k@gmail.com]
>>>> Sent: Monday, October 3, 2016 2:09 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: SOLR Sizing
>>>> 
>>>> Hi,
>>>> I am trying to estimate disk space requirements for the documents indexed
>>>> to SOLR.
>>>> I went through the LucidWorks blog (
>>>> https://lucidworks.com/blog/2011/09/14/estimating-memory-
>>>> and-storage-for-lucenesolr/)
>>>> and using this as the template. I have a question regarding estimating
>>>> "Avg. Document Size (KB)".
>>>> 
>>>> When calculating Disk Storage requirements, can we use the Java Types
>>>> sizing (
>>>> https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html)
>>>> & come up average document size?
>>>> 
>>>> Please let know if the following assumptions are correct.
>>>> 
>>>> Data Type       Size
>>>> --------------      ------
>>>> long           8 bytes
>>>> tint       4 bytes
>>>> tdate         8 bytes (Stored as long?)
>>>> string         1 byte per char for ASCII chars and 2 bytes per char for
>>>> Non-ASCII chars (Double byte chars)
>>>> text           1 byte per char for ASCII chars and 2 bytes per char for
>>>> Non-ASCII (Double byte chars) (For both with & without norm?)
>>>> ICUCollationField 2 bytes per char for Non-ASCII (Double byte chars)
>>>> boolean 1 bit?
>>>> 
>>>> Thanks,
>>>> Vasu
>>>> 
>> 
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message