lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Per Steffensen <>
Subject Re: Bloom filter
Date Mon, 04 Aug 2014 14:56:34 GMT
I just finished adding support for persisted ("backed" as I call them) 
bloom-filters in Guava Bloom Filter. Implemented one kind of persisted 
bloom-filter that works on memory mapped files.
I have changed our Solr code so that it uses such a enhanced Guava Bloom 
Filter. Making sure it is kept up to date and using it when quick "does 
definitely not exist checks" will help performance.

We do duplicate check also, because we also might get the "same" data 
from our external provider numerous times. We do it using unique-id 
feature in Solr where we make sure that if and only if (in practice) a 
two documents are "the same" they have the same id. We encode most info 
on the document in its id - including hashes of textual fields. Works 
like a charm. It is exactly in this case we want to improve performance. 
Most of the time a document does not already exist when we do this 
duplicate check (using the unique-id feature), but it just takes 
relatively long time to verify it, because you have to visit the index. 
We can get a quick "document with this id does not exist" using 
bloom-filter on id.

Regards, Per Steffensen

On 03/08/14 03:58, Umesh Prasad wrote:
> +1 to Guava's BloomFilter implementation.
> You can actually hook into UpdateProcessor chain and have the logic of
> updating bloom filter / checking there.
> We had a somewhat similar use case.  We were using DIH and it was possible
> that same solr input document (meaning same content) will be coming lots of
> times and it was leading to a lot of unnecessary updates in index. I
> introduced a DuplicateDetector using update processor chain which kept a
> map of Unique ID --> solr doc hash code and will drop the document if it
> was a duplicate.
> There is a nice video of other usage of Update chain

View raw message