lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Krupansky <jack.krupan...@gmail.com>
Subject Re: Lucene FieldType & specifying numeric type (double, float, )
Date Fri, 25 Mar 2016 22:23:09 GMT
Mike, thanks for that blog post link. I just read it, and looked at some
code. Thanks to your post I can at least pretend to feel that I know a
little bit about what has been going on! I even know now what BKD refers to
(Block K-D tree), and that it simultaneously is a replacement for Trie
fields and multi-dimensional.

(Please let me know if this discussion should be moved elsewhere, either to
Jira or a fresh thread, although it seems germane to David's original
inquiry, at least a little.)

1. You need to update the post a little, like the change for
ExactPointQuery that occurred on 2/20, a few days after your postt:
https://issues.apache.org/jira/browse/LUCENE-7039

In particular, now we have IntPoint.newExactQuery(field, value),
IntPoint.newRangeQuery(field, lowerValue, upperValue), etc.

2. Note that as in the actual API, those are "values", not "points." In
fact, the Javadoc says "Create a query for matching an exact integer value"
and "Create a range query for integer values."
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/document/IntPoint.java

3. The class declared as "class IntPoint extends Field", which feels a bit
odd without adding any useful info. I mean, why isn't IntPoint extending
Point? And these real are fields, not points. I'd suggest sticking with
"IntField". I mean, the Javadoc does say "An indexed {@code int} field."
Ditto for the other numeric XxxPoint classes.

4. I didn't notice a DatePoint class in the Lucene search package. I'm sure
it's floating around somewhere, but it does seem odd that it's not... right
there with Int, Float, Double, et al.

5. It would help people to speak of a numeric field as a "space" which
happens to be a 1-dimensional line (redundant there!), so that the value in
a numeric field is then effectively a "point" in that 1D space. That's if
we're going to stick with this conception of simple, scalar, numeric fields
as being "points", but I think it makes more sense to speak of numeric
fields with dimensionality, like 2D/3D dimensional int/float/double field.
The n numeric values do happen to correspond to a "point" when n>1, but at
the API level they seem to be dimensional values. I mean, even for 2D and
3D, the Javadoc for Int/Float/DoublePoint.newRangeQuery says "Create a
range query for n-dimensional integer values."

6. Your post refers to "a new feature called *dimensional points
<https://issues.apache.org/jira/browse/LUCENE-6852>*", but that term
doesn't seem to be used commonly in the code or Jira (just a couple of
references, but not in titles.) Besides, it seems redundant - I mean, when
does a point not have dimensionality? I would suggest renaming that to
"dimensional values" or dimValues, rather than "points." Or, maybe just
abstractly as "dimensional fields" to indicate that numeric fields support
multiple dimensions now. To me, it feels like there should be a
DimensionalField derived from Field that is used as base for IntField, et
al, to reinforce the dimensionality and provide a common base in the
Javadoc, or other places in the code that wish to reference to fields that
are either dimensional or numeric. Or, maybe it should just be NumericField?

7. I see a minor bug in an exception:

    if (lowerPoint.length != upperPoint.length) {
      throw new IllegalArgumentException("lowerPoint has length=" + numDims
+ " but upperPoint has different length=" + upperPoint.length);
    }

numDims should be lowerPoint.length. For a simple Int"Point" (Field!) then
length would be 4 but numDims would be 1.

See:
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java

8. I was a little disappointed that a point query wasn't a lot faster than
trie field. I mean, 25% is decent, but I would have imagined that all of
this work would have resulted in more like a 400% gain in speed. Is the
current implementation master considered optimal or does it have a lot of
room for improvement? Also, is this for an indexed primarily cached in OS
system memory or primarily accessed with I/O? And, I'm curious whether
exact point and narrow range queries (e.g., trying to select less than
0.25% of indexed documents) are indeed only 25% faster than trie.

My apologies for my limited depth of comprehension on all of this new work.


-- Jack Krupansky

On Thu, Mar 24, 2016 at 12:51 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> See also my recent blog post describing this new feature:
> https://www.elastic.co/blog/lucene-points-6.0
>
> Net/net, in the 1D case, points looks like a win across the board vs.
> the legacy (postings) implementation.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Mar 24, 2016 at 12:33 PM, Robert Muir <rcmuir@gmail.com> wrote:
> > On Thu, Mar 24, 2016 at 12:16 PM, Joel Bernstein <joelsolr@gmail.com>
> wrote:
> >> I'm pretty confused about points as well and until very recently thought
> >> these we geo-spacial improvements only.
> >>
> >> It would be good to understand the mechanics of points versus numerics.
> I'm
> >> particularly interested in not losing the high performance numeric
> DocValues
> >> support, which has become so important for analytics.
> >>
> >
> > Unrelated. points are the structure used to find matching documents
> > from e.g. a query point, range, radius, shape, whatever. They use a
> > tree-like structure for this. So the replacement for NumericRangeQuery
> > which "simulates" a tree with an inverted index.
> >
> > Instead of inverted index+postings list, we just have a proper tree
> > structure for these things: fixed-width, multidimensional values. It
> > has a different indexreader api for example, that lets you control how
> > the tree is traversed as it goes (by returning INSIDE [collect all the
> > docids in here blindly, this entire tree range is relevant], OUTSIDE
> > [not relevant to my query, don't traverse this region anymore], or
> > CROSSES [i may or may not be interested, have to traverse further to
> > nodes (sub-ranges or values themselves)].
> >
> > They also have the advantage of not being limited to 64 bits or 1
> > dimension, you can have up to 128 bits and up to 8 dimensions. So each
> > thing you are adding to your document is really a "point in
> > n-dimensional space", so if you want to have 3 lat+long pairs as a
> > double[] in a single field, that works as you expect.
> >
> > See more information here:
> >
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/PointValues.java#L35-L79
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Mime
View raw message