lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5189) Numeric DocValues Updates
Date Tue, 03 Sep 2013 10:32:54 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13756494#comment-13756494
] 

Michael McCandless commented on LUCENE-5189:
--------------------------------------------

Patch looks great!  I review some of the remaining nocommits:

{quote}
// nocommit no one calls this method, why do we have it? and if we need it, do we need one
for docValuesGen too?
public void setDelGen(long delGen) {
{quote}

Nuke it!  We only use .advanceNextWriteDelGen (and the patch adds this
for DVs too).

{quote}
// nocommit no one calls this, remove?
void clearDelGen() {
{quote}

Nuke it!

bq. class ReadersAndLiveDocs { // nocommit (RENAME) to ReaderAndUpdates?

+1 for ReaderAndUpdates

{quote}
    // nocommit why do we do that, vs relying on TrackingDir.getCreatedFiles(),
    // like we do for updates?
{quote}

That's a good question ... I'm not sure.  We in fact already use
TrackingDirWrapper (in ReadersAndLiveDocs.writeLiveDocs)... so we
could in theory record those files in SIPC and remove
LiveDocsFormat.files().  Maybe make this a TODO though?

{quote}
// nocommit: review!
final static int BYTES_PER_NUMERIC_UPDATE = BYTES_PER_DEL_TERM + 2*RamUsageEstimator.NUM_BYTES_OBJECT_REF
+ RamUsageEstimator.NUM_BYTES_INT + RamUsageEstimator.NUM_BYTES_LONG;
{quote}

I think it makes sense to start from BYTES_PER_DEL_TERM, but then
instead of mapping to value Integer we map to value
Map<String,NumericUpdate> whose per-Term RAM usage is something like:

{noformat}
  PTR (for LinkedHashMap, since it must link each entry to the next?)

  Map
    HEADER
    PTR (to array)
    3 INT
    1 FLOAT

  for each occupied Entry<String,NumericUpdate>
    PTR (from Map's entries array) * 2 (overhead for load factor)
    HEADER
    PTR * 2 (key, value)

    String key
      HEADER
      INT
      PTR (to char[])
      
      ARRAY_HEADER + 2 * length-of-string (field)

    NumericUpdate value
      HEADER
      PTR (to Term; ram already accounted for)
      PTR (to String; ram already accounted for)
      PTR (to Long value) + HEADER + 8 (long)    
      INT
{noformat}

The thing is, this is so hairy ... that I think maybe we should
instead use RamUsageEstimator to "calibrate" this?  Ie, make a
standalone test that keeps adding Term + fields into this structure
and measures the RAM with RUE?  Do this on 32 bit and on 64 bit JVM,
and then conditionalize the constants.  You'll still need to add in
bytes according to field/term lengths...

bq. +public class SegmentInfoPerCommit { // nocommit (RENAME) to SegmentCommit?

Not sure about that rename ... since this class is just the "metadata"
about a commit, not an "actual" commit.  Maybe SegmentCommitInfo?

                
> Numeric DocValues Updates
> -------------------------
>
>                 Key: LUCENE-5189
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5189
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>         Attachments: LUCENE-5189.patch, LUCENE-5189.patch, LUCENE-5189.patch, LUCENE-5189.patch,
LUCENE-5189.patch
>
>
> In LUCENE-4258 we started to work on incremental field updates, however the amount of
changes are immense and hard to follow/consume. The reason is that we targeted postings, stored
fields, DV etc., all from the get go.
> I'd like to start afresh here, with numeric-dv-field updates only. There are a couple
of reasons to that:
> * NumericDV fields should be easier to update, if e.g. we write all the values of all
the documents in a segment for the updated field (similar to how livedocs work, and previously
norms).
> * It's a fairly contained issue, attempting to handle just one data type to update, yet
requires many changes to core code which will also be useful for updating other data types.
> * It has value in and on itself, and we don't need to allow updating all the data types
in Lucene at once ... we can do that gradually.
> I have some working patch already which I'll upload next, explaining the changes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message