lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Ernst <r...@iernst.net>
Subject Re: Lucene FieldType & specifying numeric type (double, float, )
Date Thu, 24 Mar 2016 16:16:51 GMT
Scalar doesnt mean anything. Point is simple, it is a point in n
dimensional space, that is what the data structure provides for fast
searching on. Numbers are points in one dimensional space. Think of a
number line.
On Mar 24, 2016 8:37 AM, "David Smiley" <david.w.smiley@gmail.com> wrote:

> bq. it wasn't at all clear that the intention was that simple scalars
> would now and forever henceforth be referred to as "points". My impression
> at the time was that the focus of the Jira was on implementation and
> storage level indexing detail rather than the user-facing API level. I see
> now that I was wrong about that. It just seems to me that there should have
> been a more direct public discussion of eliminating the concept of scalar
> values at the API level.
>
> I knew because I was following closely, but otherwise I agree with your
> sentiment.  I don't love the "PointValues" terminology either nor did I
> like "DimensionalValues"; I should have suggested alternatives at the time
> but the Mike & Rob tag-team were working so fast that I didn't interject in
> the narrow window of time before a patch was put up with the current
> names.  More time to publicly discuss would have been better.  FWIW I like
> your suggestion for "Scalar"; that's more meaningful to me.  Naming is hard.
>
> ~ David
>
> On Thu, Mar 24, 2016 at 11:28 AM Jack Krupansky <jack.krupansky@gmail.com>
> wrote:
>
>> I wasn't paying close attention when this whole PointValues saga was
>> unfolding. I get the value of points for spatial data, but conflating the
>> terms "point" and "numeric" is bizarre to say the least. Reading the code,
>> I see "Points represent numeric values", which seems nonsensical to me. A
>> little later the code comment says "Geospatial Point Types - Although basic
>> point types such as DoublePoint support points in multi-dimensional space
>> too, Lucene has specialized classes for location data...", which continues
>> this odd use of terminology. I mean, aren't all points spatial by
>> definition, so that "Geospatial Point" is redundant? It would make more
>> sense to speak of a point as a geospatial number, or that a point is
>> represented by numbers.
>>
>> IOW, NumericValues would make more sense as the base, with (spatial)
>> PointValues derived from the base of numeric values. At least to me that
>> would make more sense.
>>
>> As the PointValues was progressing I had no idea that its intent was to
>> subsume, replace, or deprecate traditional scalar numeric value support in
>> Lucene (or Solr.) It came across primarily as being an improvement for
>> spatial search.
>>
>> Not that I have any objection to greatly improved storage in Lucene, but
>> to now have to speak of all numeric data as points seems quite... weird.
>>
>> Sure, I saw the Jira traffic, like LUCENE-6825 (Add multidimensional
>> byte[] indexing support to Lucene) and LUCENE-6852 (Add DimensionalFormat
>> to Codec), but in all honesty that really did come across as relating to
>> purely spatial data and not being applicable to basic scalar number support.
>>
>> Looking at CHANGES.TXT, I see references like "LUCENE-6852, LUCENE-6975:
>> Add support for points (dimensionally indexed values)", but without any
>> hint that the intent was to subsume or replace non-dimensional numeric
>> indexed values.
>>
>> Now for all I know, non-dimensional (scalar) numeric data can very
>> efficiently be handled as if it had dimension, but that's not exactly
>> obvious and warrants at least some illumination. In traditional terminology
>> a point is 0-dimension (a line is 1-dimension, and a plane is 2-dimension),
>> but traditionally a raw number - a scalar - hasn't been referred to as
>> having dimension, so that is a new concept warranting clear definition.
>>
>> Yeah, I do recall seeing LUCENE-6917 (Deprecate and rename
>> NumericField/RangeQuery to LegacyNumeric) go by in the Jira traffic, and
>> shame on me for not reading the details more carefully, but it wasn't at
>> all clear that the intention was that simple scalars would now and forever
>> henceforth be referred to as "points". My impression at the time was that
>> the focus of the Jira was on implementation and storage level indexing
>> detail rather than the user-facing API level. I see now that I was wrong
>> about that. It just seems to me that there should have been a more direct
>> public discussion of eliminating the concept of scalar values at the API
>> level.
>>
>> (I wonder what physics would be like if they started referring to scalar
>> quantities as vectors.)
>>
>> My apologies for the rant.
>>
>>
>> -- Jack Krupansky
>>
>> On Thu, Mar 24, 2016 at 10:34 AM, David Smiley <david.w.smiley@gmail.com>
>> wrote:
>>
>>> With the move to PointValues and away from trie based indexing of the
>>> terms index, for numerics, everything associated with the trie stuff seems
>>> to be labelled as "Legacy" and marked deprecated.  Even
>>> FieldType.NumericType (now FieldType.LegacyNumericType) -- a simple enum of
>>> INT, LONG, FLOAT, DOUBLE.  I wonder if we ought to reconsider doing this
>>> for FieldType.NumericType, as it articulates the type of numeric data; it
>>> need not be associated with just trie indexing of terms data; it could
>>> articulate how any numeric data is encoded, be it docValues or
>>> pointValues.  This is useful metadata.  It's not strictly required, true,
>>> but its useful in describing what goes in the field.  This makes a
>>> FieldType instance fairly self-sufficient.  Otherwise, say you have
>>> docValue numerics and/or pointValues, it's ambiguous how the data should be
>>> interpreted.  This doesn't lead to a bug but would help debugging and
>>> allowing APIs to express field requirements simply by providing a FieldType
>>> instance for numeric data.  It used to be self sufficient but now if we
>>> imagine the legacy stuff being removed, it's ambiguous.  In addition, it
>>> would be useful metadata if it found it's way into FieldInfo.  Then, say
>>> Luke, could help you know what's there and maybe search it.
>>>
>>> Thoughts?
>>>
>>> ~ David
>>> --
>>> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
>>> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
>>> http://www.solrenterprisesearchserver.com
>>>
>>
>> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
>

Mime
View raw message