lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tony Ma <...@opentext.com>
Subject Re: [EXTERNAL] - Re: Is docvalue sorted by value?
Date Wed, 07 Mar 2018 02:27:20 GMT
Thanks Erick!!
Index sorting and early termination is what I am looking for. 

On 3/6/18, 11:33 PM, "Erick Erickson" <erickerickson@gmail.com> wrote:

    OK, you're asking a different question I think.
    
    See SOLR-5730 and SOLR-8621, particularly SOLR-5730. This will work
    only a single field which you decide at index time. You can still sort
    by any field at the same expense as now, but since your docs are
    ordered by one field the early termination part won't be applicable to
    other fields.
    
    Best,
    Erick
    
    On Mon, Mar 5, 2018 at 6:28 PM, Tony Ma <tma@opentext.com> wrote:
    > Hi Erick,
    >
    > I raise this question is about the sorting scenario as you mentioned in #2.
    >
    > If the hit docs are about 100, and my query just want top 2. If the values are not
sorted, it has to iterate all 100 docs and find top2 in a priority queue. If the values are
already sorted, it just need to iterate first 2. If the query is unselective, the hit doc
might be huge, pre-sort or not will have big differences.
    >
    > I understand your thinking that if the doc values are not persisted with doc id sequence,
it is unable to retrieve field value by doc id.
    >
    > Actually, I am just wondering how lucene handle the sorting scenario, is iterating
all values of all docs unavoidable?
    >
    >
    > On 3/6/18, 6:50 AM, "Erick Erickson" <erickerickson@gmail.com> wrote:
    >
    >     I think there are two issues here that are being conflated
    >     1> _within_ a document, i.e. for a multi-valued field the values are
    >     stored as Dominik says as a SORTED_SET. Not only will they be returned
    >     (if you return from docValues rather than stored) in lexical order,
    >     but identical values will be collapsed
    >
    >     2> across multiple documents, the question about  "...persisted with
    >     order of values, not document id..." really makes no sense. The point
    >     of DocValues is to answer the question "for document X what is the
    >     value of field Y". X here is the _internal_ document ID. Now consider
    >     a search. There are two documents that are hits, doc 35 and doc 198
    >     (internal lucene doc ID). To sort them by field Y you have to know
    >     what the value in that field is for those two docs is. How would
    >     "pre-ordering" the values help here? If I have the _values_ in order,
    >     I have no clue what docs are associated with them. That question is
    >     what the "inverted index" is there to answer.
    >
    >     So I have doc 35 and 198. Think of DocValues as a large array indexed
    >     by internal doc id. To know how these two docs sort all I have to do
    >     is index into the array. It's slightly more complicated than that, but
    >     conceptually that's what happens.
    >
    >     Best,
    >     Erick
    >
    >     On Mon, Mar 5, 2018 at 11:29 AM, Dominik Safaric
    >     <dominiksafaric@gmail.com> wrote:
    >     >> So, can doc values be persisted with order of values, not document id?
This should be fast in sort scenario that the values are pre-ordered instead of scan/sort
at runtime.
    >     >
    >     >
    >     > No, unfortunately doc values cannot be persisted in order. Lucene stores
this values internally as a DocValuesType.SORTED_SET, where the values are being stored using
for example Long.compareTo().
    >     >
    >     > If you'd like to retrieve the values in insertion order, use stored instead
of doc values instead of. Then you might access the values in order using the LeafReader's
document function. However, beware that may induce performance issues because it requires
loading the document from disk.
    >     >
    >     > If you require to store and retrieve multiple numeric values per document
in order, you might consider using PointValues. PointValues are internally indexed with KD-trees.
But, beware that PointValues have a limited dimensionality, in terms that you can for example
store values in 8 dimensions, each of max 16 bytes.
    >     >
    >     >> On 5 Mar 2018, at 15:33, Tony Ma <tma@opentext.com> wrote:
    >     >>
    >     >> Per my understanding, doc values (binary doc values / numeric doc values)
are stored with sequence of document id. Sorted numeric doc values just means if a document
has multiple values, the values will be sorted for same document, but for different documents,
the value is still ordered by document id. Is that true?
    >     >> So, can doc values be persisted with order of values, not document id?
This should be fast in sort scenario that the values are pre-ordered instead of scan/sort
at runtime.
    >     >
    >     >
    >     > ---------------------------------------------------------------------
    >     > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    >     > For additional commands, e-mail: java-user-help@lucene.apache.org
    >     >
    >
    >     ---------------------------------------------------------------------
    >     To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    >     For additional commands, e-mail: java-user-help@lucene.apache.org
    >
    >
    >
    
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    
    

Mime
View raw message