lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuad Efendi" <f...@efendi.ca>
Subject RE: Lucene FieldCache memory requirements
Date Mon, 02 Nov 2009 23:44:16 GMT

Thank you very much Mike,

I found it:
org.apache.solr.request.SimpleFacets
...
        // TODO: future logic could use filters instead of the fieldcache if
        // the number of terms in the field is small enough.
        counts = getFieldCacheCounts(searcher, base, field, offset,limit,
mincount, missing, sort, prefix);
...
    FieldCache.StringIndex si =
FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName);
    final String[] terms = si.lookup;
    final int[] termNum = si.order;
...


So that 64-bit requires more memory :)


Mike, am I right here?
[(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents (100mlns)]
(64-bit JVM)
1.2Gb RAM for this...

Or, may be I am wrong:
> For Lucene directly, simple strings would consume an pointer (4 or 8
> bytes depending on whether your JRE is 64bit) per doc, and the string
> index would consume an int (4 bytes) per doc.

[8 bytes (64bit)] x [number of documents (100mlns)]? 
0.8Gb

Kind of Map between String and DocSet, saving 4 bytes... "Key" is String,
and "Value" is array of 64-bit pointers to Document. Why 64-bit (for 64-bit
JVM)? I always thought it is (int) documentId...

Am I right?


Thanks for pointing to http://issues.apache.org/jira/browse/LUCENE-1990!

>> Note that for your use case, this is exceptionally wasteful.  
This is probably very common case... I think it should be confirmed by
Lucene developers too... FieldCache is warmed anyway, even when we don't use
SOLR...

 
-Fuad







> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: November-02-09 6:00 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Lucene FieldCache memory requirements
> 
> OK I think someone who knows how Solr uses the fieldCache for this
> type of field will have to pipe up.
> 
> For Lucene directly, simple strings would consume an pointer (4 or 8
> bytes depending on whether your JRE is 64bit) per doc, and the string
> index would consume an int (4 bytes) per doc.  (Each also consume
> negligible (for your case) memory to hold the actual string values).
> 
> Note that for your use case, this is exceptionally wasteful.  If
> Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this)
> then it'd take much fewer bits to reference the values, since you have
> only 10 unique string values.
> 
> Mike
> 
> On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi <fuad@efendi.ca> wrote:
> > I am not using Lucene API directly; I am using SOLR which uses Lucene
> > FieldCache for faceting on non-tokenized fields...
> > I think this cache will be lazily loaded, until user executes sorted (by
> > this field) SOLR query for all documents *:* - in this case it will be
fully
> > populated...
> >
> >
> >> Subject: Re: Lucene FieldCache memory requirements
> >>
> >> Which FieldCache API are you using?  getStrings?  or getStringIndex
> >> (which is used, under the hood, if you sort by this field).
> >>
> >> Mike
> >>
> >> On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi <fuad@efendi.ca> wrote:
> >> > Any thoughts regarding the subject? I hope FieldCache doesn't use
more
> > than
> >> > 6 bytes per document-field instance... I am too lazy to research
Lucene
> >> > source code, I hope someone can provide exact answer... Thanks
> >> >
> >> >
> >> >> Subject: Lucene FieldCache memory requirements
> >> >>
> >> >> Hi,
> >> >>
> >> >>
> >> >> Can anyone confirm Lucene FieldCache memory requirements? I have 100
> >> >> millions docs with non-tokenized field "country" (10 different
> > countries);
> >> > I
> >> >> expect it requires array of ("int", "long"), size of array
100,000,000,
> >> >> without any impact of "country" field length;
> >> >>
> >> >> it requires 600,000,000 bytes: "int" is pointer to document (Lucene
> >> > document
> >> >> ID),  and "long" is pointer to String value...
> >> >>
> >> >> Am I right, is it 600Mb just for this "country" (indexed,
> > non-tokenized,
> >> >> non-boolean) field and 100 millions docs? I need to calculate exact
> >> > minimum RAM
> >> >> requirements...
> >> >>
> >> >> I believe it shouldn't depend on cardinality (distribution) of
field...
> >> >>
> >> >> Thanks,
> >> >> Fuad
> >> >>
> >> >>
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >> >
> >
> >
> >



Mime
View raw message