lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Rutherglen (JIRA)" <>
Subject [jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
Date Thu, 08 Jan 2009 16:46:59 GMT


Jason Rutherglen commented on LUCENE-1476:

Marvin: "The whole tombstone idea arose out of the need for (close to) realtime search! It's
intended to improve write speed."

It does improve the write speed.  I found in making the realtime search write speed fast enough
that writing to individual files per segment can become too costly (they accumulate fast,
appending to a single file is faster than creating new files, deleting the files becomes costly).
 For example, writing to small individual files per commit, if the number of segments is large
and the delete spans multiple segments will generate many files.  This is variable based on
how often the updates are expected to occur.  I modeled this after the extreme case of the
frequency of updates of a MySQL instance backing data for a web application.

The MySQL design, translated to Lucene is a transaction log per index.  Where the updates
consisting of documents and deletes are written to the transaction log file.  If Lucene crashed
for some reason the transaction log would be replayed.  The in memory indexes and newly deleted
document bitvectors would be held in RAM (LUCENE-1314) until flushed (the in memory indexes
and deleted documents) manually or based on memory usage.  Many users may not want a transaction
log as they may be storing the updates in a separate SQL database instance (this is the case
where I work) and so a transaction log is redundant and should be optional.  The first implementation
of this will not have a transaction log.

Marvin: "I don't think I understand. Is this the "combination index reader/writer" model,
where the writer prepares a data structure that then gets handed off to the reader?"

It would be exposed as a combination reader writer that manages the transaction status of
each update.  The internal architecture is such that after each update a new reader representing
the new documents and deletes for the transaction is generated and put onto a stack.  The
reader stack is drained based on whether a reader is too old to be useful anymore (i.e. no
references to it, or it's has N number of readers ahead of it).  

> BitVector implement DocIdSet
> ----------------------------
>                 Key: LUCENE-1476
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Trivial
>         Attachments: LUCENE-1476.patch
>   Original Estimate: 12h
>  Remaining Estimate: 12h
> BitVector can implement DocIdSet.  This is for making SegmentReader.deletedDocs pluggable.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message