lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Bitwise Operations on Integer Fields in Lucene and Solr Index
Date Thu, 13 May 2010 22:42:35 GMT
On 2010-05-13 23:27, Israel Ekpo wrote:
> Hello Lucene and Solr Community
> I have a custom that I would like to
> contribute to the Lucene and Solr projects.
> So I would need some direction as to how to create and ISSUE or submit a
> patch.
> It looks like there have been changes to the way this is done since the
> latest merge of the two projects (Lucene and Solr).
> Recently, some Solr users have been looking for a way to perform bitwise
> operations between and integer value and some fields in the Index
> So, I wrote a Solr QParser plugin to do this using a custom Lucene Filter.
> This package makes it possible to filter results returned from a query based
> on the results of a bitwise operation on an integer field in the documents
> returned from the pre-constructed query.


What a coincidence! :) I'm working on something very similar, only the
use case that I need to support is slightly different - I want to
support a ranked search based on a bitwise overlap of query value and
field value. That is, the number of differing bits would reduce the
score. This scenario occurs e.g. during near-duplicate detection that
uses fuzzy signatures, on document- or sentence levels.

I'm going to submit my code early next week, it still needs some
polishing. I have two ways to execute this query, neither of which uses
filters at the moment:

* method 1: during indexing the bits in the fields are turned into
on/off terms on the same field, and during search a BooleanQuery is
formed from the int value with the same terms. Scoring is courtesy of
BooleanScorer. This method supports only a single int value per field.

* method 2, incomplete yet - during indexing the bits are turned into
terms as before, but this method supports multiple int values per field:
terms that correspond to bitmasks on the same value are put at the same
positions. Then a specialized Query / Scorer traverses all 32 posting
lists in parallel, moving through all matching docs and scoring
according to how many terms matched at the same position.

I wrapped this in a Solr FieldType, and instead of using a custom
QParser plugin I simply implemented FieldType.getFieldQuery().

It would be great to work out a convenient user-level API for this
feature, both the scoring and the non-scoring case.

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message