lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <yo...@apache.org>
Subject Re: flushRamSegments() is "over merging"?
Date Tue, 15 Aug 2006 15:29:53 GMT
Related to merging more often than one would expect, check out my last
comment in this bug:
http://issues.apache.org/jira/browse/LUCENE-388

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

On 8/15/06, Yonik Seeley <yonik@apache.org> wrote:
> Yes, that's counter-intuitive.... a high merge factor is more likely
> to cause a merge with the last disk-based segment.
>
> On the other hand... if you have a high maxBufferedDocs and a normal
> mergeFactor (much more likely), you could end up with way too many
> segments if you didn't merge.
>
> Hmmm, I'm thinking of another case where you could end up with far too
> many segments... if you have a low merge factor and high
> maxBufferedDocs (a common scenario), then if you add enough docs it
> will keep creating a separate segment.
>
> Consider the following settings:
> mergeFactor=10
> maxBufferedDocs=10000
>
> Now add 11 docs at a time to an existing index, closing inbetween.
> segment sizes: 100000, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
> 11, 11, ...
>
> It seems like the merge logic somewhere should also take into account
> the number of segments at a certain level, not just the number of
> documents in those segments.
>
> -Yonik
>
> On 8/15/06, Doron Cohen <DORONC@il.ibm.com> wrote:
> >
> > Hi, I ran into this while reviewing the patch for 565.
> >
> > It appears that closing an index writer with non empty ram segments (at
> > least 1 doc was added) is causing a merge with the last (most recent) on
> > disk segment.
> >
> > This seems to me problematic in the case that an application has a lot of
> > interleaving - adding / removing documents, or even switching indexes,
> > therefore the indexWriter would be closed often.
> >
> > The test case below demonstrates this behavior - all maxBufferedDocs,
> > maxMergeDocs, mergeFactor are assigned very large values, and in a loop a
> > few documents are added and the indexWriter is closed and re-opened.
> >
> > Surprisingly (at least for me) the number of segments on disk remains 1.
> > In other words, each time the IndexWriter is closed, the single disk
> > segment is merged with the current ram segments and re-written to a new
> > disk segments.
> >
> > The "blame" is in the second line here:
> >     if (minSegment < 0 ||                   // add one FS segment?
> >         (docCount + segmentInfos.info(minSegment).docCount) > mergeFactor
> > ||
> >         !(segmentInfos.info(segmentInfos.size()-1).dir == ramDirectory))
> >
> > This code in flushRamSegments() merges the (temporary) ram segments with
> > the most recent non-temporary segment.
> >
> > I can see how this can make sense in some cases. Perhaps an additional
> > constraint should be added on the ratio of the size of this non-temp
> > segment to that of all temporary segments, or the difference, or both.
> >
> > Here is the test case,
> > Thanks,
> > Doron
> > ------------------------------------
> > package org.apache.lucene.index;
> >
> > import java.io.IOException;
> >
> > import junit.framework.TestCase;
> >
> > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > import org.apache.lucene.document.Document;
> > import org.apache.lucene.document.Field;
> > import org.apache.lucene.store.Directory;
> > import org.apache.lucene.store.FSDirectory;
> > import org.apache.lucene.store.RAMDirectory;
> >
> > /**
> >  * Test that the number of segments is as expected.
> >  * I.e. that there was not too many / too few merges.
> >  *
> >  * @author Doron Cohen
> >  */
> > public class TestNumSegments extends TestCase {
> >
> >       protected int nextDocNum = 0;
> >       protected Directory dir = null;
> >       protected IndexWriter iw = null;
> >       protected IndexReader ir = null;
> >
> >       /* (non-Javadoc)
> >        * @see junit.framework.TestCase#setUp()
> >        */
> >       protected void setUp() throws Exception {
> >             super.setUp();
> >             //dir = new RAMDirectory();
> >             dir = FSDirectory.getDirectory("test.num.segments",true);
> >             iw = new IndexWriter(dir, new StandardAnalyzer(), true);
> >             setLimits(iw);
> >             addSomeDocs(); // some docs in index
> >       }
> >
> >       // for now, take these limits out of the "game"
> >       protected void setLimits(IndexWriter iw) {
> >             iw.setMaxBufferedDocs(Integer.MAX_VALUE-1);
> >             iw.setMaxMergeDocs(Integer.MAX_VALUE-1);
> >             iw.setMergeFactor(Integer.MAX_VALUE-1);
> >       }
> >
> >       /* (non-Javadoc)
> >        * @see junit.framework.TestCase#tearDown()
> >        */
> >       protected void tearDown() throws Exception {
> >             closeW();
> >             if (dir!=null) {
> >                   dir.close();
> >             }
> >             super.tearDown();
> >       }
> >
> >       // count how many segments are on a directory - index writer must be
> > closed
> >       protected int countDirSegments() throws IOException {
> >             assertNull(iw);
> >             SegmentInfos segmentInfos = new SegmentInfos();
> >             segmentInfos.read(dir);
> >             int nSegs = segmentInfos.size();
> >             segmentInfos.clear();
> >             return nSegs;
> >       }
> >
> >       // open writer
> >       private void openW() throws IOException {
> >             iw = new IndexWriter(dir, new StandardAnalyzer(), false);
> >             setLimits(iw);
> >       }
> >
> >       private void closeW() throws IOException {
> >             if (iw!=null) {
> >                   iw.close();
> >                   iw=null;
> >             }
> >       }
> >
> >       public void testNumSegments() throws IOException {
> >             int numExceptions = 0;
> >             for (int i=1; i<30; i++) {
> >                   closeW();
> >                   try {
> >                         assertEquals("Oops - wrong number of segments!", i,
> > countDirSegments());
> >                   } catch (Throwable t) {
> >                         numExceptions++;
> >                         System.err.println(i+":  "+t.getMessage());
> >                   }
> >                   openW();
> >                   addSomeDocs();
> >             }
> >             assertEquals("Oops!, so many times numbr of egments was
> > \"wrong\"",0,numExceptions);
> >       }
> >
> >       private void addSomeDocs() throws IOException {
> >             for (int i=0; i<2; i++) {
> >                   iw.addDocument(getDoc());
> >             }
> >       }
> >
> >       protected Document getDoc() {
> >             Document doc = new Document();
> >             doc.add(new Field("body", new Integer(nextDocNum).toString(),
> > Field.Store.YES, Field.Index.UN_TOKENIZED));
> >             doc.add(new Field("all", "x", Field.Store.YES,
> > Field.Index.UN_TOKENIZED));
> >             nextDocNum ++;
> >             return doc;
> >       }
> >
> > }
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message