lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Grotzke <martin.grot...@javakaffee.de>
Subject Re: How to read values of a field efficiently
Date Tue, 31 Jul 2007 10:10:56 GMT
On Mon, 2007-07-30 at 00:30 -0700, Chris Hostetter wrote:
> : Is it possible to get the values from the ValueSource (or from
> : getFieldCacheCounts) sorted by its natural order (from lowest to
> : highest values)?
> 
> well, an inverted term index is already a data structure listing terms
> from lowest to highest and the associated documents -- so if you want to
> iterate from low to high between a range and find matching docs you should
> just use hte TermEnum -- the whole point of the FieldCache (and
> FieldCacheSource) is to have a "reverse inverted index" so you can quickly
> fetch the indexed value if you know the docId.
Ok, I will have a look at the TermEnum and try this.

> 
> perhaps you should elaborate a little more on what it is you are trying to
> do so we can help you figure out how to do it more efficinelty ...
I want to read all values of the price field of the found docs,
and calculate the mean value and the standard deviation.
Based on the min value (mean - deviation, the max value (mean +
deviation) and the number of prices I calculate price ranges.

Then I iterate over the sorted array of prices and count how many
prices go into the current range.

This sorting (Arrays.sort) takes much time, that's why I asked if
it's possible to read values in sorted order.

But reading this, I think it would also be possible to skip sorting and
check for each price into which bucket it would go and increment the
counter for this bucket - this should also be a possibility for
optimization.

> ... perhaps you shouldn't be iterating over every doc to figure out your
> ranges .. perhaps you can iterate over the terms themselves?
Are you referring to TermEnum with this?

Thanx && cheers,
Martin


> 
> 
> hang on ... rereading your first message i just noticed something i
> definitely didn't spot before...
> 
> >> Fairly long: getFieldCacheCounts for the cat field takes ~70 ms
> >> for the second request, while reading prices takes ~600 ms.
> 
> ...i clearly missed this, and fixated on your assertion that your reading
> of field values took longer then the stock methods -- but you're not just
> comparing the time needed byu different methods, you're also timing
> different fields.
> 
> this actually makes a lot of sense since there are probably a lot fewer
> unique values for the cat field, so there are a lot fewer discrete values
> to deal with when computing counts.
> 
> 
> 
> 
> -Hoss
> 
-- 
Martin Grotzke
http://www.javakaffee.de/blog/

Mime
View raw message