lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jake Mannix (JIRA)" <>
Subject [jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl
Date Sun, 08 Nov 2009 00:21:32 GMT


Jake Mannix commented on LUCENE-1526:

bq. But, I agree it's wasteful of space when deletes are so
sparse... though it is fast.

It's fast for random access, but it's really slow if you need to make a lot of these (either
during heavy indexing if copy-on-write, or during heavy query load if copy-on-reopen).

bq. So are you using this, only, as your deleted docs? Ie you don't store
the deletions with Lucene? I'm getting confused if this is only for
the NRT case, or, in general.

These are only to augment the deleted docs *of the disk reader* - the disk reader isn't reopened
at all except infrequently - once a batch (a big enough RAMDirectory is filled, or enough
time goes by, depending on configuration) is ready to be flushed to disk, diskReader.addIndexes
is called and when the diskReader is reopened, the deletes live in the normal diskReader's
delete set.   Before this time is ready, when there is a batch in ram that hasn't been flushed,
the IntSetAccelerator is applied to the not-reopened diskReader.  It's a copy-on-read ThreadLocal.

So I'm not sure if that described it correctly: only the deletes which should have been applied
to the diskReader are treated separately - those are basically batched: for T amount of time
or D amount of docs (configurable) whichever comes first, they are applied to the diskReader,
which knows about Lucene's regular deletions and now these new ones as well.   Once the memory
is flushed to disk, the in-memory delSet is emptied, and applied to the diskReader using regular
apis before reopening.

bq. OK, I think I'm catching up here... so you only open a new reader at
the batch boundary right? Ie, a batch update (all its adds & deletes)
is atomic from the readers standpoint?

Yes - disk reader, you mean, right?  This is only reopened at batch boundary.

bq. OK so a batch is quickly reopened, using bloom filter + int set for
fast "contains" check for the deletions that occurred during that
batch (and, custom TermDocs that does the "and not deleted"). This
gets you your fast turnaround and decent search performance.

The reopening isn't that quick, but it's in the background, or are you talking about the RAMDirectory?
 Yeah, that is reopened per query (if necessary - if there are no changes, of course no reopen),
but it is kept very small (10k docs or less, for example).  It's actually pretty fantastic
performance - check out the zoie perf pages:

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>                 Key: LUCENE-1526
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1526.patch
>   Original Estimate: 168h
>  Remaining Estimate: 168h
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message