lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Krupansky <jack.krupan...@gmail.com>
Subject Re: Lucene FieldType & specifying numeric type (double, float, )
Date Sat, 26 Mar 2016 00:35:03 GMT
Thanks, Mike. I see the prompt commits!

1. I wasn't suggesting to revert to the old implementation of IntField,
just reusing the name - simply renaming IntPoint to IntField.

2. Since my previous message I see that you (and others) have been also
using the term "dimensionalValues" (not points) in some Jiras related to
this work, so the terminology use does need to get cleaned up:
https://issues.apache.org/jira/browse/LUCENE-6917
https://issues.apache.org/jira/browse/SOLR-8396

3. I've also added some comments on a related Solr Jira that intersect with
the Lucene points stuff that you might want to chime in on:
https://issues.apache.org/jira/browse/SOLR-8396

4. The main question on all of this - my points - is whether any of the
senior committers (especially Solr) wish to elevate the importance of any
of these points from my modest level of rambling.

With that, I think I'm done on this topic... for now.

-- Jack Krupansky

On Fri, Mar 25, 2016 at 8:12 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Fri, Mar 25, 2016 at 6:23 PM, Jack Krupansky
> <jack.krupansky@gmail.com> wrote:
>
> > Mike, thanks for that blog post link.
>
> You're welcome!
>
> > (Please let me know if this discussion should be moved elsewhere, either
> to
> > Jira or a fresh thread, although it seems germane to David's original
> > inquiry, at least a little.)
>
> Here seems good.
>
> > 1. You need to update the post a little, like the change for
> ExactPointQuery
> > that occurred on 2/20, a few days after your postt:
> > https://issues.apache.org/jira/browse/LUCENE-7039
> >
> > In particular, now we have IntPoint.newExactQuery(field, value),
> > IntPoint.newRangeQuery(field, lowerValue, upperValue), etc.
>
> Thanks, but I likely won't update it ... stuff changes over time ;)
> If I spent time updating my old posts I would never get anything else
> done!
>
> And the post does state that points are unreleased / subject to change,
> iirc.
>
> > 2. Note that as in the actual API, those are "values", not "points." In
> > fact, the Javadoc says "Create a query for matching an exact integer
> value"
> > and "Create a range query for integer values."
> >
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/document/IntPoint.java
>
> Yeah, I guess we sometimes say value for 1D points?
>
> > 3. The class declared as "class IntPoint extends Field", which feels a
> bit
> > odd without adding any useful info. I mean, why isn't IntPoint extending
> > Point? And these real are fields, not points. I'd suggest sticking with
> > "IntField". I mean, the Javadoc does say "An indexed {@code int} field."
> > Ditto for the other numeric XxxPoint classes.
>
> Well, Field is the base class Lucene uses for "that which you add to a
> Document for indexing", so we pretty much have to subclass it.
>
> I don't think we should just "re-use" the previous IntField: that
> would be very trappy, the implementation is very different, you can
> index 2D points, etc.
>
> > 4. I didn't notice a DatePoint class in the Lucene search package. I'm
> sure
> > it's floating around somewhere, but it does seem odd that it's not...
> right
> > there with Int, Float, Double, et al.
>
> You're right, patches welcome!
>
> > 5. It would help people to speak of a numeric field as a "space" which
> > happens to be a 1-dimensional line (redundant there!), so that the value
> in
> > a numeric field is then effectively a "point" in that 1D space. That's if
> > we're going to stick with this conception of simple, scalar, numeric
> fields
> > as being "points", but I think it makes more sense to speak of numeric
> > fields with dimensionality, like 2D/3D dimensional int/float/double
> field.
> > The n numeric values do happen to correspond to a "point" when n>1, but
> at
> > the API level they seem to be dimensional values. I mean, even for 2D and
> > 3D, the Javadoc for Int/Float/DoublePoint.newRangeQuery says "Create a
> range
> > query for n-dimensional integer values."
>
> Not sure what you're saying here.
>
> > 6. Your post refers to "a new feature called dimensional points", but
> that
> > term doesn't seem to be used commonly in the code or Jira (just a couple
> of
> > references, but not in titles.) Besides, it seems redundant - I mean,
> when
> > does a point not have dimensionality? I would suggest renaming that to
> > "dimensional values" or dimValues, rather than "points." Or, maybe just
> > abstractly as "dimensional fields" to indicate that numeric fields
> support
> > multiple dimensions now. To me, it feels like there should be a
> > DimensionalField derived from Field that is used as base for IntField, et
> > al, to reinforce the dimensionality and provide a common base in the
> > Javadoc, or other places in the code that wish to reference to fields
> that
> > are either dimensional or numeric. Or, maybe it should just be
> NumericField?
>
> I think "points" (sounds like N dims) is more correct for the general
> feature name and its related classes, than "value" (sounds like 1D).
>
> > 7. I see a minor bug in an exception:
> >
> >     if (lowerPoint.length != upperPoint.length) {
> >       throw new IllegalArgumentException("lowerPoint has length=" +
> numDims
> > + " but upperPoint has different length=" + upperPoint.length);
> >     }
> >
> > numDims should be lowerPoint.length. For a simple Int"Point" (Field!)
> then
> > length would be 4 but numDims would be 1.
>
> Thanks!  I'll go fix that.
>
> > 8. I was a little disappointed that a point query wasn't a lot faster
> than
> > trie field. I mean, 25% is decent, but I would have imagined that all of
> > this work would have resulted in more like a 400% gain in speed. Is the
> > current implementation master considered optimal or does it have a lot of
> > room for improvement? Also, is this for an indexed primarily cached in OS
> > system memory or primarily accessed with I/O? And, I'm curious whether
> exact
> > point and narrow range queries (e.g., trying to select less than 0.25% of
> > indexed documents) are indeed only 25% faster than trie.
>
> I would love a 1000% percent speedup, but it was what it was on that
> day that I tested ;)  I'll take 25% and faster indexing, much less
> heap, etc.
>
> It's most analogous to postings: primarily IO (sequential, in the 1D
> case), so, yeah you want those pages to be hot in the OS's IO cache.
>
> There have already been lots of changes since then, maybe the number
> is different now.
>
> Maybe a different benchmark gives different results.  Benchmarks welcome!
>
> And, yes, I'm sure there are improvements still to make.  Various devs
> have been doing so intensely for the past few weeks.  Patches welcome!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Mime
View raw message