commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Claude Warren (Jira)" <>
Subject [jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
Date Fri, 01 Nov 2019 10:10:00 GMT


Claude Warren commented on COLLECTIONS-728:

=== From the conversation above ===

 Why doesn't Shape keep a reference to Hasher? That would avoid having to pass both to e.g.
method contains in BloomFilter (and would make is unnecessary to check their consistency).

 (see the response to how are StaticHasher and DynamicHasher different for some extra background).
 The contains method with the Hasher was requested because one of the other Apache projects
was interested in checking the presence of the Bloom filter without building a complete Bloom
filter. They have the hasher and they know the shape it should be. So the method actually
needs the Hasher for the data.

Now, they can build the DynamicHasher and pass that, or they could build a StaticHasher (which
does know it's shape), or they could implement another Hasher type.

Placing the Hasher in the Shape would mean that the shape will always return the same data
when what we are doing is trying to send different data with the same shape.

Still it doesn't look right, API-wise; but again I'm lacking a clear view of all the requirements
to provide an alternative. 

=== end of quote ===

The methods that have both the Hasher and the Shape are really passing decomposed Bloom filters.
 The Hasher contains the hash values and the Shape identifies shape of the decomposed Bloom
filter.   True the Hasher could be limited to a StaticHasher and the Shape would not be needed
as the Static Hahser contains the Shape it was built with, but in the general case a Hasher
does not have a Shape. 

The reason for the separation is the Hasher applies a hash function to data and returns int
The shape indicates to the Hasher how many times it should apply the hashing algorithm (k
in the maths) to each data item being hashes (item = the X in DynamicHasher.with(X)).  The
Shape also give the upper limit for the values the hasher returns on the particular getBits()
call.  The upper limit is "m" in the maths.

So Hasher + Shape = an iteration of integers that are the bits turned on the Bloom filter.

A single hasher can generated Bloom filters of an almost infinite number of Shapes.

--- snip --

=== start of quote === 

>From KISS and DRY POVs, I disagree with both statements: The factory could be immutable
and readily provide all the useful algorithms (see e.g. RandomSource).

=== end of quote ===

There are two issues with the idea that we can provide all the usefull algorithms.

1) We will always be behind when it comes to implementing new algorithms using new hash functions.
 It is better to allow the users in the field to add algorithms as they see fit.  I remember
an instructor one said the the only valid limits in computer science are zero, one and many.
 Providing an implementation with some but not allowing for addition of others is very limiting.

2) There are known cases (Cassandra comes to mind) where an implementation of a Hashing algorithm
was incorrectly executed and released into the wild.  If such an application wanted to use
this library they would be unable to add their "broken" hashing to the factory would be an

So why not provide a factory that " provide all the useful algorithms" and has the convenience
methods to add hash functions?

Once we add hash functions it seems a short step to resetting the factory to default.  At
a minimum this would be needed for testing.

> BloomFilter contribution
> ------------------------
>                 Key: COLLECTIONS-728
>                 URL:
>             Project: Commons Collections
>          Issue Type: Task
>            Reporter: Claude Warren
>            Priority: Minor
>         Attachments:,,,
> Contribution of BloomFilter library comprising base implementation and gated collections.

This message was sent by Atlassian Jira

View raw message