lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yonik Seeley <yo...@lucidimagination.com>
Subject Re: Questions on FieldValueCache
Date Mon, 03 Aug 2009 20:56:11 GMT
On Mon, Aug 3, 2009 at 4:18 PM, Stephen Duncan
Jr<stephen.duncan@gmail.com> wrote:
> On Mon, Aug 3, 2009 at 2:43 PM, Yonik Seeley <yonik@lucidimagination.com>wrote:
> Hmm, that's a hard thing to sell to the user and my boss, as it makes the
> query time go from nearly always being sub-second (frequently less than 60
> ms), to ranging up to nearly 4 seconds for a new query not already in the
> cache.  (My test was with 100 facets being requested, which may be
> reasonable, as one reason to facet on a full-text field to provide a dynamic
> world-cloud).

Could you possibly profile it to find out what the hotspot is?
We don't really have a good algorithm for faceting text fields, but it
would be nice to see what the current bottleneck is.

> How can I mitigate the time it takes with the enum method?  Do I need to ask
> for more facet values in my facet-warming query (I set facet.limit to 1 as
> it didn't seem to matter to the FieldValueCache)?

Yes, it matters for the enum method because of the smart
short-circuiting that takes place.
Use a base query that matches fewer documents than the size of the
sets you want cached.
Set the limit higher to avoid short circuiting

> And/Or do I need to up the
> autowarmCount on the FilterCache?

Not if you have a static warming query that includes the facets you
are interested in.

> If speed is the primary concern vs
> memory, should I bother with the minDf setting?

minDf is pretty much just for memory savings.  But if you turn it down
or eliminate it, make sure your filterCache is big enough to hold a
filter for each possible term.

> I guess I should update my code to use the enum method on all the fields
> that are likely to risk crossing this line.  Should I be looking at the
> termInstances property on the fields that are displayed in the
> FieldValueCache on the stats page, and figuring those on the order of 10
> million are likely to grow past the limit?

For an index over 16M docs, it's perhaps closer to 16M/avg_bytes_per_term*256.

The storage space for terms that aren't "big terms" (which come from
the fieldCache) is 256 byte arrays, each which can be up to 16MB in
size.  Every 65536 block of documents shares one of those byte arrays
(or more if you have more than 16M documents).  So the average
document can't take up more than 256 bytes in the array.  That doesn't
mean 256 term instances though... that's the max.  The list is delta
encoded vints, so if there are many terms, each vint could be bigger.

More details in UnInvertedField after the comment:
      //
      // transform intermediate form into the final form, building a
single byte[]
      // at a time, and releasing the intermediate byte[]s as we go to avoid
      // increasing the memory footprint.
      //

-Yonik
http://www.lucidimagination.com

Mime
View raw message