lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mikhail Khludnev <mkhlud...@griddynamics.com>
Subject Re: yet another one (mad) approach for join
Date Tue, 29 Apr 2014 18:59:24 GMT
Let me challenge it another way.

If we look at Lucene JoinUtil, it works in two phases:
- looping "from" side docs, getting a term by docNum via
DocValues/FieldCache
- searching those term texts in "to" side field;

intermediate term texts seems like a really inconvenient thing.
What if we have a numeric docValue field contains parent's docNum. Thus, we
need just loop "from" side docs, getting parent docNum and drop them into a
bitset. It seems like a low hanging fruit, and it can get over JoinUtil
already.

the miserable question is how to index a parent docnum into a docvalue
field. So, far we can do that with updatable DocValues. Thanks to Shai!¡!

I need to know what you think so, far.

But coming back to block-join, it's noticeable that it has uber-performance
hint - it can leap-frop from parent to child and vice-versa. Hence if the
join query above is intersected with a highly selective "to"-side filter -
we waste resources to calculate whole "to"-side bitset. Sad.

Let's go deeper, limiting by the single value join case, aka 1:N (to:from).
Let's write a sequence of related docNums into binary DocValues field. How
to index it? - we have Binary DocValues updates! Thanks to Shai!¡!

That's what I have, so far: I can loop "to"-side filter, put sequences of
related docs into a heap and then, this heap is intersected with
"from"-side filter. As a result, I expect to get an another one join query
which is between join and block join in terms of query time, and also
between them in terms of cost of update.

WDYT about both of algorithms?


On Thu, Feb 13, 2014 at 9:09 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Thu, Feb 13, 2014 at 11:49 AM, Mikhail Khludnev
> <mkhludnev@griddynamics.com> wrote:
> > Mike,
> > Thanks for the clue. It raises a lot of questions:
> >  - Is this cost caused by random access nature of the seekCeil() /** The
> > target term may be before or after the current term. */? Is there any
> chance
> > to make it more efficient by requesting "forward only" TermEnum?
>
> Maybe we could get some gains with a "forward only" mode ... not sure.
>  The enum already shares state, i.e. it checks for the common prefix
> b/w the term it's on now and the term you're seeking to.
>
> But also traversing the FST is not cheap.
>
> >  - will it be faster with 'entirely memory residend term dictionary'?
>
> Likely ...
>
> >  - or the overall idea of using TermEnum just complies with the sub, and
> > it's worth to experiment with writing the previous parent docnum (or
> current
> > block size) in payload and reading it when we need to jump back on
> > advance()?
>
> Maybe?
>
> But I think using FBS is an OK solution too... that's a very fast way
> to find prev/next parent.
>
> >  - once again, would you mind to remind why making DocEnum capable to
> jump
> > back is so hard? Can you recommend any starting point for hacking?
>
> Well, all the impls are heavily "forward only", e.g. we store the doc
> deltas in an int[128] and sum as we go.
>
> But anything we can do to improve block join, or query time join,
> would be great. E.g. for query time join I think this issue would be a
> big win: https://issues.apache.org/jira/browse/LUCENE-4771
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mkhludnev@griddynamics.com>

Mime
View raw message