lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations
Date Tue, 28 Sep 2010 09:32:34 GMT


Michael McCandless commented on LUCENE-2575:

bq. A copy of the byte[][] refs is made when getReader is called.

Hmm why can't the reader just use the current byte[][]?  The writer only adds in new blocks
to this array (doesn't overwrite the already written blocks, until flush)?  (And then allocates
a new byte[][] once that array is full).

I think the issue at the
moment is I'm using a boolean[] to signify if a byte[] needs to
be copied before being written to
Hmm so we also copy-on-write a given byte[] block?  Is this because JMM can't make the guarantees
we need about other threads reading the bytes written?

I have a suspicion we'll change our minds about pooling byte[]s.
We may end up implementing ref counting anyways (as described
above), and the sudden garbage generated could be a massive
change for users?

But even if we do reuse, we will cause tons of garbage, until the still-open readers are closed?
 Ie we cannot re-use the byte[] being "held open" by any NRT reader that's still referencing
the in-RAM segment after that segment had been flushed to disk.

Also the garbage shouldn't be that bad since each object is large.  It's not like 3.x's situation
with FieldCache or terms dict index, for example....

I would start simple by dropping reuse.  We can then add it back if we see perf issues?

Both very common types of queries, so we probably need some type
of skipping, which we will, it'll just be single-level.
I would start simple, here, and make skipping stupid, ie just scan.  You can get everything
working, all tests passing, etc., and then adding in skipping is much more isolated change.
 You need all the isolation you can get here!  This stuff is *hairy*.

As a side note, there is still an issue in my mind around the
term frequencies parallel array (introduced in these patches),
in that we'd need to make a copy of it for each reader (because
if it changes, the scoring model becomes inaccurate?).

Hmm your'e right that each reader needs a private copy, to remain truly "point in time". 
This (4 bytes per unique term X number of readers reading that term) is a non-trivial addition
of RAM.

BTW I'm assuming IW will now be modal?  Ie caller must tell IW up front if NRT readers will
be used?  Because non-NRT users shouldn't have to pay all this added RAM cost?

> Concurrent byte and int block implementations
> ---------------------------------------------
>                 Key: LUCENE-2575
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>             Fix For: Realtime Branch
>         Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch
> The current *BlockPool implementations aren't quite concurrent.
> We really need something that has a locking flush method, where
> flush is called at the end of adding a document. Once flushed,
> the newly written data would be available to all other reading
> threads (ie, postings etc). I'm not sure I understand the slices
> concept, it seems like it'd be easier to implement a seekable
> random access file like API. One'd seek to a given position,
> then read or write from there. The underlying management of byte
> arrays could then be hidden?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message