lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marvin Humphrey (JIRA)" <>
Subject [jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
Date Thu, 08 Jan 2009 19:12:59 GMT


Marvin Humphrey commented on LUCENE-1476:

Jason Rutherglen:

> I found in making the realtime search write speed fast enough that writing
> to individual files per segment can become too costly (they accumulate fast,
> appending to a single file is faster than creating new files, deleting the
> files becomes costly). 

I saw you mentioning i/o overhead on Windows in particular.  I can't see a way
to mod Lucene so that it doesn't generate a bunch of files for each commit,
and FWIW Lucy/KS is going to generate even more files than Lucene.

Half-seriously... how about writing a single-file Directory implementation?

> For example, writing to small individual files per commit, if the number of
> segments is large and the delete spans multiple segments will generate many
> files. 

There would be a maximum of two files per segment to hold the tombstones: one
to hold the tombstone rows, and one to map segment identifiers to tombstone
rows.  (In Lucy/KS, the mappings would probably be stored in the JSON-encoded
"segmeta" file, which stores human-readable metadata on behalf of multiple 

Segments containing tombstones would be merged according to whatever merge
policy was in place.  So there won't ever be an obscene number of tombstone
files unless you allow an obscene number of segments to accumulate.

> Many users may not want a transaction log as they may be storing the updates
> in a separate SQL database instance (this is the case where I work) and so a
> transaction log is redundant and should be optional. 

I can see how this would be quite useful at the application level.  However, I
think it might be challenging to generalize the transaction log concept at the
library level:

CustomAnalyzer analyzer = new CustomAnalyzer();
IndexWriter indexWriter = new IndexWriter(analyzer, "/path/to/index");
analyzer.setFoo(2); // change of state not recorded by transaction log

MySQL is more of a closed system than Lucene, which I think makes options
available that aren't available to us.

> The reader stack is drained based on whether a reader is too old to be
> useful anymore (i.e. no references to it, or it's has N number of readers
> ahead of it).

Right, this is the kind of thing that Lucene has to do because of the
single-reader model, and that were trying to get away from in Lucy/KS by
exploiting mmap and making IndexReaders cheap wrappers around the system i/o

I don't think I can offer any alternative design suggestions that meet your
needs.   There's going to be a change rate that overwhelms the multi-file
commit system, and it seems that you've determined you're up against it.  

What's killing us is something different: not absolute change rate, but poor 
worst-case performance.

FWIW, we contemplated a multi-index system with an index on a RAM disk for
fast changes and a primary index on the main file system.  It would have
worked fine for pure adds, but it was very tricky to manage state for
documents which were being "updated", i.e.  deleted and re-added.  How are you
handling all these small adds with your combo reader/writer?  Do you not have
that problem?

> BitVector implement DocIdSet
> ----------------------------
>                 Key: LUCENE-1476
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Trivial
>         Attachments: LUCENE-1476.patch
>   Original Estimate: 12h
>  Remaining Estimate: 12h
> BitVector can implement DocIdSet.  This is for making SegmentReader.deletedDocs pluggable.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message