commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Claude Warren (Jira)" <>
Subject [jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
Date Wed, 30 Oct 2019 12:10:00 GMT


Claude Warren commented on COLLECTIONS-728:

I originally answered this in an email, but it didn't get posted back here so here are my

  * Nit: There are formatting issues which I mentioned in other reviews.

Yep, this was just a new look at a possible solution.  I expect there to be a fair amount
of cleanup to be done.

    * Why an _abstract_ class instead of an _interface_ (and _default_ methods)?

Interface does not support protected methods.  In addition using _default_, which is described
in the literature as a mechanism for extending existing interfaces, to build a new interface
doesn't seem right.  Also, are there any cases of Bloom filters being anything more than a
Bloom filter?

    * Why pass a {{Hasher}} to the {{Shape}} constructor when only its "name" is used?

Passing the hasher ensures that the name is actually bound.  It could be just the name.  By
passing the Hasher there is some check on the validity of the name, but that is not necessary.

    * Why are methods _final_?  It's redundant for _private_ ones, and strange for those name
{{estimate...}} (one could imagine that subclasses could have their own way).

I missed the private final, you are correct the final should be removed.

The others are final as they are based on definitions and mathematical proofs from the literature.
 The final could be removed if desired. 


    * Why doesn't {{Shape}} keep a reference to {{Hasher}}?  That would avoid having to pass
both to e.g. method {{contains}} in {{BloomFilter}} (and would make is unnecessary to check
their consistency).

(see the response to how are StaticHasher and DynamicHasher different for some extra background).
The contains method with the Hasher was requested because one of the other Apache projects
was interested in checking the presence of the Bloom filter without building a complete Bloom
filter.  They have the hasher and they know the shape it should be.  So the method actually
needs the Hasher for the data.

Now, they can build the DynamicHasher and pass that, or they could build a StaticHasher (which
does know it's shape), or they could implement another Hasher type. 

Placing the Hasher in the Shape would mean that the shape will always return the same data
when what we are doing is trying to send different data with the same shape.


    * I don't get the {{reset()}} method in {{HasherFactory}}.  At first sight, this class
should be immutable.

Because hashers can be added to the Factory, there needs to be a way to reset them.  This
is the similar to the problem in Apache Jena with  the datatypes TypeMapper where the mapping
between external objects and their datatype names is maintained.  Items can be added but experience
shows they also need to be reset.  Particularly when testing.

    * In {{DynamicHasher}}, it seems that the "lock" construct could be advantageously replaced
with a builder (i.e. the returned {{Hasher}} instance would only offer the {{getBits()}} API.

Yes, that is a valid and reasonable construct.

    * It seems strange to define the {{Hasher}} interface inside {{HasherFactory}} (IMO that
should be the other way around).

Makes no difference to me, Hasher and HasherFactory could be defined on their own as well.

    * What is the difference between {{StaticHasher}} and {{DynamicHasher}}?

First the similarity:  Both return integers that are the indexes of bits to be turned on based
on the shape argument to getBits( Shape ).

DynamicHasher uses a list of ByteBuffers to generate the indexes on the fly.  Can generate
for any Shape.

StaticHasher stores a list of indexes that were generated for a specific Shape.  Can only
provide indexes for the original Shape and does it by returning an iterator over a stored
list of integers.  (Much faster)

    *  I'm afraid that the review is not exhaustive; again, I have a hard time figuring out
the fundamental (design) structure from convenience features. 

The fundamental methods that the BloomFilter must have (using the new contains() rather than
matches() method) are contains() and merge().
However for disparate implementations to be able to execute the contains() and merge() methods
some common representation of the underlying bit storage must be made available.  This implementation
has 2: getBits() [ long buffer encoded bit pattern ], and getHasher() [ iterator over t the
indexes of the bits that are turned on ].  These two solutions allow for performance gains
to be made depending on the size of the Bloom filter (number of bits possible - m) and the
number of bits that are turned on in the filter (hammingValue or cardinality).

Fundamental checking methods are the methods that ensure the shapes of the filter being compared
or merged  are the same.  Thus the getShape() and the protected verifyX() methods.

All other methods are "convenience" methods, though are generally expected in any developed
Bloom filter implementation.  Implementations are provided based on the fundamental methods
getBits() and/or getHasher().

 (new comment -- not in response to the above.)

Another reviewer suggested getting rid of matches() and only have contains() as it is much
more descriptive and when you have both Bloom filters (as required for matches)  it is possible
to call contains() instead (e.g. A.matches(B) is the same as B.contains( A )).  This sounds
reasonable to me.

> BloomFilter contribution
> ------------------------
>                 Key: COLLECTIONS-728
>                 URL:
>             Project: Commons Collections
>          Issue Type: Task
>            Reporter: Claude Warren
>            Priority: Minor
>         Attachments:,,,
> Contribution of BloomFilter library comprising base implementation and gated collections.

This message was sent by Atlassian Jira

View raw message