lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravikumar Govindarajan <ravikumar.govindara...@gmail.com>
Subject Re: SortingAtomicReader alternate to Tim-Sort...
Date Tue, 21 Apr 2015 08:00:53 GMT
Thanks for the comments…

My only
> concern about using the FixedBitSet is that it would make sorting each
> postings list run in O(maxDoc) but maybe we can make it better by
> using SparseFixedBitSet


Yes I was also thinking about this. But we are on 4.x and did not take the
plunge. But as you said, it should be a good idea to test on SFBS

I'm curious if you already performed any kind of benchmarking of this
> approach?


Yes we did a stress test of sorts aimed at SortingMergePolicy. We made most
of our data as RAM resident and then CPU hot-spots came up...

There were few take-aways from the test. I am listing down few of them..
It's kind of lengthy. Please read through...

a) Postings-List issue, as discussed above…

b) CompressingStoredFieldsReader did not store the last decoded 32KB chunk.
Our segments are already sorted before participating in a merge. On mostly
linear merge, we ended up decoding the same chunk again and again. Simply
storing the last chunk resulted in good speed-ups for us...

c) Once above steps were corrected, the CPU hotspot shifted to
BlockDocsEnum. Here most of our postings-list < 128 docs. So
Lucene41Postings started using vInts…  I did try ForUtil encoding even for
< 128 docs. It definitely went easy on CPU. But failed to measure resulting
file-size increase.

I realised not just SMP but any other merge must face the same issue and
left it at that..

--
Ravi

On Mon, Apr 20, 2015 at 12:56 PM, Adrien Grand <jpountz@gmail.com> wrote:

> I like these ideas, the int[] we are using today are wasteful. My only
> concern about using the FixedBitSet is that it would make sorting each
> postings list run in O(maxDoc) but maybe we can make it better by
> using SparseFixedBitSet (added in 5.0, given your code snippets I
> assume you are still on 4.x)?
>
> I'm curious if you already performed any kind of benchmarking of this
> approach?
>
>
> On Tue, Apr 14, 2015 at 2:07 PM, Ravikumar Govindarajan
> <ravikumar.govindarajan@gmail.com> wrote:
> > We were experimenting with SortingMergePolicy and came across an
> alternate
> > solution to TimSort of postings-list using FBS & GrowableWriter.
> >
> > I have attached relevant code-snippet. It would be nice if someone can
> > clarify whether it is a good idea to implement...
> >
> > public class SortingAtomicReader {
> > …
> > …
> > class SortingDocsEnum {
> >
> > //Last 2 variables namely *newdoclist* & *olddocToFreq* are added in
> > //constructor. It is assumed that these 2 variables are init during
> > //merge start & they are then re-used till merge completes...
> >
> >
> > public SortingDocsEnum(int maxDoc, final DocsEnum in, boolean withFreqs,
> > final Sorter.DocMap docMap, FixedBitSet newdoclist, GrowableWriter
> > olddocToFreq) throws IOException {
> >
> > ….
> >
> > …
> >
> > while (true) {
> >
> >   //Instead of Tim-Sorting as in existing code
> >
> >   doc = in.nextDoc();
> >
> >   int newdoc = docMap.oldToNew(doc);
> >
> >   newdoclist.set(newdoc);
> >
> >   if(withFreqs) {
> >
> >     olddocToFreq.set(doc, in.freq());
> >
> >   }
> >
> > }
> >
> >
> > @Override
> >
> > public int nextDoc() throws IOException {
> >
> >   if (++docIt >= upto) {
> >
> >   return NO_MORE_DOCS;
> >
> >   }
> >
> >   currDoc = newdoclist.nextSetBit(++currDoc);
> >
> >   if(currDoc == -1) {
> >
> >     return NO_MORE_DOCS;
> >
> >   }
> >
> >   //clear the set-bit here before returning...
> >
> >   newdoclist.clear(currDoc);
> >
> >   return currDoc;
> >
> > }
> >
> >
> > @Override
> >
> > public int freq() throws IOException {
> >
> >   if(withFreqs && docIt < upto) {
> >
> >   return (int)olddocToFreq.getMutable()
> >
> >                  .get(docMap.newToOld(currDoc));
> >
> >   }
> >
> >   return 1;
> >
> > }
> >
> > }
>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message