lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From András Péteri <apet...@b2international.com>
Subject Re: Mapping doc values back to doc ID (in decent time)
Date Sun, 09 Aug 2015 15:51:54 GMT
If I understand it correctly, the Zoie library [1][2] implements the
"sledgehammer" approach by collecting docValues for all documents when a
segment reader is opened. If you have some RAM to throw at the problem,
this could indeed bring you an acceptable level of performance.

[1] http://senseidb.github.io/zoie/
[2]
https://github.com/senseidb/zoie/blob/master/zoie-core/src/main/java/proj/zoie/api/impl/DocIDMapperImpl.java

On Sun, Aug 9, 2015 at 9:41 AM, Trejkaz <trejkaz@trypticon.org> wrote:

> On Fri, Aug 7, 2015 at 5:34 PM, Adrien Grand <jpountz@gmail.com> wrote:
> > Does your application actually iterate in order over dense ids, or is
> > it just for benchmarking purposes? Because if it does, you probably
> > don't actually need seeking, you could just see what the current ID in
> > the terms enum is.
>
> Both dense ID fetches and individual ID fetches exist in the
> application. I put them in a benchmark deliberately doing it as
> individual fetches to get an idea of average timing for a single
> operation.
>
> There are so many use cases of doing the individual fetches that it's
> tough to enumerate. The first one I found was "fetch the term vector
> for ID + field" but I'm sure there will be tons of them.
>
> For mapping a dense set of IDs to doc IDs (e.g. for filtering), I
> would probably use something like DocValuesTermsQuery for that to get
> them all in one shot. I also wondered whether writing our filters as
> queries would help, but I think it would turn out to be about as fast
> as DocValuesTermsQuery even if I did that.
>
> I'm sure the only way to really improve the speed of these filters is
> to start storing these things in the text index and use query-time
> joins, but I can't do that until I solve the issue of relying on
> stable doc IDs and it seems like trying to solve two large problems in
> a single commit would be biting off more than I can chew.
>
> > If you actually need seeking, then you should try
> > to avoid MultiFields, it will call seedExact on each segment, while
> > given what I see you could just stop after you found one segment with
> > the value.
>
> Ah, I did wonder whether MultiFields had any behaviour like that, so
> that definitely means that I will avoid using it. Then I can try other
> tricks, like trying the seeks in order of segment size (the largest
> segment is most likely to contain the hit.)
>
> TX
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
-- 
András

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message