lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arjen van der Meijden <acmmail...@tweakers.net>
Subject Re: Compressing docValues with variable length bytes[] by block of 16k ?
Date Sun, 09 Aug 2015 14:55:54 GMT


On 9-8-2015 16:22, Toke Eskildsen wrote:
> Robert Muir <rcmuir@gmail.com> wrote:
>> I am tired of repeating this:
>> Don't use BINARY docvalues
>> Don't use BINARY docvalues
>> Don't use BINARY docvalues
>> Use types like SORTED/SORTED_SET which will compress the term
>> dictionary and make use of ordinals in your application instead.
> This seems contrary to
> http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/document/BinaryDocValuesField.html
>
> Maybe you could update the JavaDoc for that field to warn against using it?
It (probably) depends on the contents of the values. If the number of
distinct values is roughly equal to the number of documents the javadoc
suggest the binary docvalues are a valid choice.

That's this part:
"The values are stored directly with no sharing, which is a good fit
when the fields don't share (many) values, such as a title field."

If there are (much) less distinct values than documents, Robert's reply
and the documentation suggest the same:
" If values may be shared and sorted it's better to use
SortedDocValuesField."

So as soon as compression of smallish values starts making sense due to
repetition amongst documents, it may be time to move away from the
BinaryDocValuesField towards another variant.

If only parts of the values are repeated (for instance something like 
e-mail addresses where many will end with 'gmail.com' and 'outlook.com')
it becomes more complicated.

Best regards,

Arjen

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message