lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-7905) Optimizations for OrdinalMap
Date Wed, 12 Jul 2017 19:58:00 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael McCandless updated LUCENE-7905:
---------------------------------------
    Attachment: LUCENE-7905-specialized.patch

I tried specializing the PQ (patch attached) but it was only a small improvement so I don't
plan to pursue it any more.  It could be I should have also specialized the logic to look
for all subs sharing the same term?  But I ran out of time I have to play with this so much
;)

I also ran YourKit and found some silly slow code in {{DefaultSortedSetDocValuesReaderState}},
where it does a linear scan of all ord -> BytesRef to find the ord range of each dimension;
we could do this much more efficiently with binary searching instead.  I'll put a TODO but
don't plan to tackle that now/here.  That was 12% of the time.

Other hot areas were calling {{next()}} on the doc values (~25%) and {{updateTop}} in the
tem queue (~46%).

I do feel like how we compare terms in the PQ is inefficient, and we should be able to do
something like what radix sort does, because at any given time, the terms in the queue likely
share long common prefixes yet we keep inefficiently re-comparing those long common prefixes.
 It seems like there should be a powerful optimization here, where if we could somehow efficiently
know that e.g. right now all terms have a common prefix of length=5, and then only do our
comparisons starting from the 6th digit, or something ... but I don't see how to do it.


> Optimizations for OrdinalMap
> ----------------------------
>
>                 Key: LUCENE-7905
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7905
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 7.1
>
>         Attachments: LUCENE-7905.patch, LUCENE-7905-specialized.patch
>
>
> {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to global space,
but it's fairly costly to build, which must typically be done on every NRT refresh.
> I'm using it quite heavily in two different places, one for {{SortedSetDocValuesFacetCounts}},
and another custom usage, and I found some small optimizations to improve its construction
time.
> I switched it to use a simple priority queue to merge the terms instead of the more general
{{MultiTermsEnum}}, which does extra work since it must also provide postings, implement seekExact,
etc.
> I also pulled {{OrdinalMap}} out into its own oal.index class.
> When testing construction time for my case the patch is ~16% faster (159.9s -> 134.2s)
in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in another case with 26.6
M terms.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message