lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrien Grand <jpou...@gmail.com>
Subject Re: Better DocSetCollector
Date Tue, 11 Aug 2015 09:57:50 GMT
The go-to class for building a DocIdSet from a sorted iterator in
Lucene would rather be RoaringDocIdSet. BitDocIdSet works too, but it
has to trade quite some iteration/build speed and memory efficiency in
order to provide random write and read access. It also looks to me
that roaring bitmaps would be a good compromise for the ideas that
Toke and Yonik are having, as it allocates chunks of memory and
happily deals with holes in the doc ID space?

On Tue, Aug 11, 2015 at 11:25 AM, Ramkumar R. Aiyengar
<andyetitmoves@gmail.com> wrote:
> I wonder if there might be value in BitDocIdSet.Builder which Lucene uses.
> It had perf issues of its soon, but LUCENE-6645 seems to have fixed them,
> and it does a similar approach as above (int array and then fixedbitset).
>
> On 3 Aug 2015 12:35, "Toke Eskildsen" <te@statsbiblioteket.dk> wrote:
>>
>> On Sat, 2015-08-01 at 15:09 -0700, Yonik Seeley wrote:
>> > I also investigated going the other way and tracking a List<int[]> and
>> > allocating in smaller chunks (and even having a memory pool to pull
>> > the fixed size chunks from) but it was slower on my first attempt and
>> > I haven't returned to try more variants yet.  It *feels* like we
>> > should be able to get overall speedups by allocating in 8K chunks or
>> > so when the effects of memory bandwidth (the cost of zeroing) and GC
>> > are considered.
>>
>> Chunked allocations of int[] would still have the problem of having the
>> copy-to-bitmap step if the result set gets too big.
>>
>> Chunks might work better with the garbage collector, compared to the
>> current solution, but I greatly prefer the idea of re-using structures.
>>
>> That being said, I realize that it is not simple to choose the proper
>> strategy:
>>
>> http://stackoverflow.com/questions/1955322/at-what-point-is-it-worth-reusing-arrays-in-java
>>
>> In the case of an update-tracked structure, the cost of zeroing is
>> linear to the amount of changed values. This makes it even harder to
>> determine the best strategy as it will be tied to concrete index size
>> and query pattern.
>>
>> - Toke Eskildsen, State and University Library, Denmark
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>



-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message