lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wun...@wunderwood.org>
Subject Re: SOLR Sizing
Date Tue, 04 Oct 2016 01:30:15 GMT
That approach doesn’t work very well for estimates.

Some parts of the index size and speed scale with the vocabulary instead of the number of
documents.
Vocabulary usually grows at about the square root of the total amount of text in the index.
OCR’ed text
breaks that estimate badly, with huge vocabularies.

Also, it is common to find non-linear jumps in performance. I’m benchmarking a change in
a 12 million
document index. It improves the 95th percentile response time for one style of query from
3.8 seconds
to 2 milliseconds. I’m testing with a log of 200k queries from a production host, so I’m
pretty sure that
is accurate.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 3, 2016, at 6:02 PM, Susheel Kumar <susheel2777@gmail.com> wrote:
> 
> In short, if you want your estimate to be closer then run some actual
> ingestion for say 1-5% of your total docs and extrapolate since every
> search product may have different schema,different set of fields, different
> index vs. stored fields,  copy fields, different analysis chain etc.
> 
> If you want to just have a very quick rough estimate, create few flat json
> sample files (below) with field names and key values(actual data for better
> estimate). Put all the fields names which you are going to index/put into
> Solr and check the json file size. This will give you average size of a doc
> and then multiply with # docs to get a rough index size.
> 
> {
> "id":"product12345"
> "name":"productA",
> "category":"xyz",
> ...
> ...
> }
> 
> Thanks,
> Susheel
> 
> On Mon, Oct 3, 2016 at 3:19 PM, Allison, Timothy B. <tallison@mitre.org>
> wrote:
> 
>> This doesn't answer your question, but Erick Erickson's blog on this topic
>> is invaluable:
>> 
>> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-
>> the-abstract-why-we-dont-have-a-definitive-answer/
>> 
>> -----Original Message-----
>> From: Vasu Y [mailto:vyal2k@gmail.com]
>> Sent: Monday, October 3, 2016 2:09 PM
>> To: solr-user@lucene.apache.org
>> Subject: SOLR Sizing
>> 
>> Hi,
>> I am trying to estimate disk space requirements for the documents indexed
>> to SOLR.
>> I went through the LucidWorks blog (
>> https://lucidworks.com/blog/2011/09/14/estimating-memory-
>> and-storage-for-lucenesolr/)
>> and using this as the template. I have a question regarding estimating
>> "Avg. Document Size (KB)".
>> 
>> When calculating Disk Storage requirements, can we use the Java Types
>> sizing (
>> https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html)
>> & come up average document size?
>> 
>> Please let know if the following assumptions are correct.
>> 
>> Data Type       Size
>> --------------      ------
>> long           8 bytes
>> tint       4 bytes
>> tdate         8 bytes (Stored as long?)
>> string         1 byte per char for ASCII chars and 2 bytes per char for
>> Non-ASCII chars (Double byte chars)
>> text           1 byte per char for ASCII chars and 2 bytes per char for
>> Non-ASCII (Double byte chars) (For both with & without norm?)
>> ICUCollationField 2 bytes per char for Non-ASCII (Double byte chars)
>> boolean 1 bit?
>> 
>> Thanks,
>> Vasu
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message