lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Lucene FieldType & specifying numeric type (double, float, )
Date Sat, 26 Mar 2016 00:12:20 GMT
On Fri, Mar 25, 2016 at 6:23 PM, Jack Krupansky
<jack.krupansky@gmail.com> wrote:

> Mike, thanks for that blog post link.

You're welcome!

> (Please let me know if this discussion should be moved elsewhere, either to
> Jira or a fresh thread, although it seems germane to David's original
> inquiry, at least a little.)

Here seems good.

> 1. You need to update the post a little, like the change for ExactPointQuery
> that occurred on 2/20, a few days after your postt:
> https://issues.apache.org/jira/browse/LUCENE-7039
>
> In particular, now we have IntPoint.newExactQuery(field, value),
> IntPoint.newRangeQuery(field, lowerValue, upperValue), etc.

Thanks, but I likely won't update it ... stuff changes over time ;)
If I spent time updating my old posts I would never get anything else
done!

And the post does state that points are unreleased / subject to change, iirc.

> 2. Note that as in the actual API, those are "values", not "points." In
> fact, the Javadoc says "Create a query for matching an exact integer value"
> and "Create a range query for integer values."
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/document/IntPoint.java

Yeah, I guess we sometimes say value for 1D points?

> 3. The class declared as "class IntPoint extends Field", which feels a bit
> odd without adding any useful info. I mean, why isn't IntPoint extending
> Point? And these real are fields, not points. I'd suggest sticking with
> "IntField". I mean, the Javadoc does say "An indexed {@code int} field."
> Ditto for the other numeric XxxPoint classes.

Well, Field is the base class Lucene uses for "that which you add to a
Document for indexing", so we pretty much have to subclass it.

I don't think we should just "re-use" the previous IntField: that
would be very trappy, the implementation is very different, you can
index 2D points, etc.

> 4. I didn't notice a DatePoint class in the Lucene search package. I'm sure
> it's floating around somewhere, but it does seem odd that it's not... right
> there with Int, Float, Double, et al.

You're right, patches welcome!

> 5. It would help people to speak of a numeric field as a "space" which
> happens to be a 1-dimensional line (redundant there!), so that the value in
> a numeric field is then effectively a "point" in that 1D space. That's if
> we're going to stick with this conception of simple, scalar, numeric fields
> as being "points", but I think it makes more sense to speak of numeric
> fields with dimensionality, like 2D/3D dimensional int/float/double field.
> The n numeric values do happen to correspond to a "point" when n>1, but at
> the API level they seem to be dimensional values. I mean, even for 2D and
> 3D, the Javadoc for Int/Float/DoublePoint.newRangeQuery says "Create a range
> query for n-dimensional integer values."

Not sure what you're saying here.

> 6. Your post refers to "a new feature called dimensional points", but that
> term doesn't seem to be used commonly in the code or Jira (just a couple of
> references, but not in titles.) Besides, it seems redundant - I mean, when
> does a point not have dimensionality? I would suggest renaming that to
> "dimensional values" or dimValues, rather than "points." Or, maybe just
> abstractly as "dimensional fields" to indicate that numeric fields support
> multiple dimensions now. To me, it feels like there should be a
> DimensionalField derived from Field that is used as base for IntField, et
> al, to reinforce the dimensionality and provide a common base in the
> Javadoc, or other places in the code that wish to reference to fields that
> are either dimensional or numeric. Or, maybe it should just be NumericField?

I think "points" (sounds like N dims) is more correct for the general
feature name and its related classes, than "value" (sounds like 1D).

> 7. I see a minor bug in an exception:
>
>     if (lowerPoint.length != upperPoint.length) {
>       throw new IllegalArgumentException("lowerPoint has length=" + numDims
> + " but upperPoint has different length=" + upperPoint.length);
>     }
>
> numDims should be lowerPoint.length. For a simple Int"Point" (Field!) then
> length would be 4 but numDims would be 1.

Thanks!  I'll go fix that.

> 8. I was a little disappointed that a point query wasn't a lot faster than
> trie field. I mean, 25% is decent, but I would have imagined that all of
> this work would have resulted in more like a 400% gain in speed. Is the
> current implementation master considered optimal or does it have a lot of
> room for improvement? Also, is this for an indexed primarily cached in OS
> system memory or primarily accessed with I/O? And, I'm curious whether exact
> point and narrow range queries (e.g., trying to select less than 0.25% of
> indexed documents) are indeed only 25% faster than trie.

I would love a 1000% percent speedup, but it was what it was on that
day that I tested ;)  I'll take 25% and faster indexing, much less
heap, etc.

It's most analogous to postings: primarily IO (sequential, in the 1D
case), so, yeah you want those pages to be hot in the OS's IO cache.

There have already been lots of changes since then, maybe the number
is different now.

Maybe a different benchmark gives different results.  Benchmarks welcome!

And, yes, I'm sure there are improvements still to make.  Various devs
have been doing so intensely for the past few weeks.  Patches welcome!

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message