commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Gregory <>
Subject Re: [collections] BloomFilter package architecture discussion
Date Mon, 14 Oct 2019 12:37:09 GMT
Just a quick note that, on my end, I might be able to get back to looking
at this PR soon-ish but it might not be until the week-end.


On Mon, Oct 14, 2019 at 8:12 AM Claude Warren <> wrote:

> Greetings,
> I feel like I have been beating my head against a wall trying to get this
> contribution accepted; this is not a complaint about process or
> personalities, just a statement of how I feel. I realize that I have made
> progress but I also realize that there is a long way to go. Furthermore,
> the other potential users and supporters have not responded to any
> communication about this proposal so I am floundering on that front too.
> What I want to do is take a step back. Let’s remove the collection packages
> from consideration and work out the base package first. With that in mind I
> would like to present reasoning behind the classes as they now stand and
> suggest a way forward to get the core classes in before adding the
> collections.
> As discussed earlier Bloom filters are a concept. In most peoples mind’s
> they are bit vectors with specific named functions (e.g. match => A and B =
> A, hammingValue => cardinality(this)). This idea of the filter as a bit
> vector led me to using a BitSet. Originally the BloomFilter was
> serializable but valid arguments against that and toward making the
> internal representation visible enought to serialize that lead to making
> the BitSet public. In addition to the normal Bloom filter methods
> (hammingValue, match, etc.) there are some methods (log, cosine distance,
> jaccard distance, etc.) that are used by researchers and other attempting
> to do some work with indexing and such for large collections of filters.
> Finally there is the question of “normal” usage. In his paper on
> “Compressed Bloom Filters”[1] (an implementation we do not have yet)
> Michael Mitzenmacher states “the Bloom filter plays a dual role. It is both
> a data structure being used at the proxies, and a message being passed
> between them”. He is discussing the issue of using bloom filters in a
> distributed cache where the proxies distributed their bloom filter so that
> all the proxies know which proxy might have the document. So the Bloom
> filter needs both an internal representation and a mechanism to pass the
> data between systems that use it. It seems obvious that the BitSet works as
> the mechanism to pass the data between systems, though an immutable version
> would be preferred.
> So the Bloom filter needs an external reference implementation of its
> state, so that it can communicate said state with other Bloom filters. The
> quesiton of what is necessary for transmission of that state is still open.
> I think we can agree a mechanism to identify which bits are on is requried
> as part of the state. I have seen discussions that indicate the hashing
> algorithm and other configuration information should also be included, but
> I believe this is dependent upon where the filter is being sent. All Bloom
> filters within a system, by necessity, have to use the same hash and
> configuration parameters. Only when sending outside the system does the
> other data need to be made available. In my mind this means those data
> structures have to be availabe to be transmitted along with the Bloom
> filter or filters.
> So now to the code in the current contribution. The argument that
> BloomFilter should be a class not an interface is a good one. However, as
> the internal representation of the bit structure can vary wildly, I think
> it should be an abstract class. My reasoning is as follows: Assuming we
> have an external reference implementation of state many/all of the methods
> could be implemented against that definition. This is what the current
> implementation attempts to do, but not successfully. Any implementation of
> the abstract class could develop an internal representation of state that
> is more efficient for whatever the use case is (e.g. counting Bloom
> filters, compressed Bloom filters, attenuated Bloom filters, etc) and
> override the methods that are more efficiently answered within its internal
> structure. In addition, the abstract Bloom filter might implement the “data
> read” methods from BitSet directly. This would allow more efficient
> implementation of several methods that currently depend upon the creation
> of a BitSet and the manipulation of same.
> I think this change yields the following:
> 1. Creation of an external reference implementation of state.
> 2. Conversion of BloomFilter for interface to AbstractBloomFilter class
> 3. Merge BloomFilterFunctions into AbstractBloomFilter
> 4. Rename StandardBloomFilter to BloomFilter (should we do this?)
> 5. Modify (Standard)BloomFilter and CountingBloomFilter to extends the
> AbstractBloomFilter class.
> On the topic of an external reference implementation of state, could this
> be an interface that AbstractBloomFilter implements? Might it comprise the
> “read data” methods from the BitSet class?
> [1]
> --
> I like: Like Like - The likeliest place on the web
> <>
> LinkedIn:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message