lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Binda <olivier.bi...@wanadoo.fr>
Subject Re: Compressing docValues with variable length bytes[] by block of 16k ?
Date Sun, 09 Aug 2015 16:09:27 GMT
On 08/09/2015 04:55 PM, Arjen van der Meijden wrote:
>
> On 9-8-2015 16:22, Toke Eskildsen wrote:
>> Robert Muir <rcmuir@gmail.com> wrote:
>>> I am tired of repeating this:
>>> Don't use BINARY docvalues
>>> Don't use BINARY docvalues
>>> Don't use BINARY docvalues
>>> Use types like SORTED/SORTED_SET which will compress the term
>>> dictionary and make use of ordinals in your application instead.
>> This seems contrary to
>> http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/document/BinaryDocValuesField.html
>>
>> Maybe you could update the JavaDoc for that field to warn against using it?
> It (probably) depends on the contents of the values. If the number of
> distinct values is roughly equal to the number of documents the javadoc
> suggest the binary docvalues are a valid choice.
My values are unique and equal to the number of documents,
They have varying sizes, say at least 10 bytes and may be a lot bigger 
(say  4kbytes)

I don't share, index or sort them.
I don't do grouping/faceting either


I only want to store, retrieve and traverse those values
>
> That's this part:
> "The values are stored directly with no sharing, which is a good fit
> when the fields don't share (many) values, such as a title field."
>
> If there are (much) less distinct values than documents, Robert's reply
> and the documentation suggest the same:
> " If values may be shared and sorted it's better to use
> SortedDocValuesField."
>
> So as soon as compression of smallish values starts making sense due to
> repetition amongst documents, it may be time to move away from the
> BinaryDocValuesField towards another variant.
>
> If only parts of the values are repeated (for instance something like
> e-mail addresses where many will end with 'gmail.com' and 'outlook.com')
> it becomes more complicated.

At the moment, there are some repeated parts inside but a lot of 
repeated parts across docIds  like "Expression", "Reading"

Also, I'm stuck with using Lucene 4.7.0 (or 4.7.2) because starting with 
version 4.8, lucene uses "try with resource" and this isn't supported on 
Android before Android 4.4


    SortedDocValuesField stores a per-document|BytesRef|
    <http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/util/BytesRef.html>value,
    indexed for sorting.


If you also need to store the value, you should add a 
separate|StoredField| 
<http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/document/StoredField.html>instance.


I actually went with the binaryDocValues because I thought that 
DocValues were way more efficient than the pre 4.0 fields to store stuff
(like only using 1 seek/read ...with mmap...), especially with traversal.

In my app, I traverse all binaryDocValues in increading docId order, 
unserializes my docValues (lightning fast with FlatBuffers, no object 
creation -> complex objects) and do some stats....

Would I be able to do that as efficiently with a StoredField ?


Apparently, only StoredField are compressed


    CompressingStoredFieldsFormat



Maybee I should use that (and ditch the useless docValue or make it 
store a bytesRef) to get compression ?

Many thanks for all the insights, :)
Olivier

> Best regards,
>
> Arjen
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message