commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claude Warren <cla...@xenei.com>
Subject [collections] BloomFilter package architecture discussion
Date Mon, 14 Oct 2019 12:12:21 GMT
Greetings,


I feel like I have been beating my head against a wall trying to get this
contribution accepted; this is not a complaint about process or
personalities, just a statement of how I feel. I realize that I have made
progress but I also realize that there is a long way to go. Furthermore,
the other potential users and supporters have not responded to any
communication about this proposal so I am floundering on that front too.
What I want to do is take a step back. Let’s remove the collection packages
from consideration and work out the base package first. With that in mind I
would like to present reasoning behind the classes as they now stand and
suggest a way forward to get the core classes in before adding the
collections.


As discussed earlier Bloom filters are a concept. In most peoples mind’s
they are bit vectors with specific named functions (e.g. match => A and B =
A, hammingValue => cardinality(this)). This idea of the filter as a bit
vector led me to using a BitSet. Originally the BloomFilter was
serializable but valid arguments against that and toward making the
internal representation visible enought to serialize that lead to making
the BitSet public. In addition to the normal Bloom filter methods
(hammingValue, match, etc.) there are some methods (log, cosine distance,
jaccard distance, etc.) that are used by researchers and other attempting
to do some work with indexing and such for large collections of filters.


Finally there is the question of “normal” usage. In his paper on
“Compressed Bloom Filters”[1] (an implementation we do not have yet)
Michael Mitzenmacher states “the Bloom filter plays a dual role. It is both
a data structure being used at the proxies, and a message being passed
between them”. He is discussing the issue of using bloom filters in a
distributed cache where the proxies distributed their bloom filter so that
all the proxies know which proxy might have the document. So the Bloom
filter needs both an internal representation and a mechanism to pass the
data between systems that use it. It seems obvious that the BitSet works as
the mechanism to pass the data between systems, though an immutable version
would be preferred.


So the Bloom filter needs an external reference implementation of its
state, so that it can communicate said state with other Bloom filters. The
quesiton of what is necessary for transmission of that state is still open.
I think we can agree a mechanism to identify which bits are on is requried
as part of the state. I have seen discussions that indicate the hashing
algorithm and other configuration information should also be included, but
I believe this is dependent upon where the filter is being sent. All Bloom
filters within a system, by necessity, have to use the same hash and
configuration parameters. Only when sending outside the system does the
other data need to be made available. In my mind this means those data
structures have to be availabe to be transmitted along with the Bloom
filter or filters.


So now to the code in the current contribution. The argument that
BloomFilter should be a class not an interface is a good one. However, as
the internal representation of the bit structure can vary wildly, I think
it should be an abstract class. My reasoning is as follows: Assuming we
have an external reference implementation of state many/all of the methods
could be implemented against that definition. This is what the current
implementation attempts to do, but not successfully. Any implementation of
the abstract class could develop an internal representation of state that
is more efficient for whatever the use case is (e.g. counting Bloom
filters, compressed Bloom filters, attenuated Bloom filters, etc) and
override the methods that are more efficiently answered within its internal
structure. In addition, the abstract Bloom filter might implement the “data
read” methods from BitSet directly. This would allow more efficient
implementation of several methods that currently depend upon the creation
of a BitSet and the manipulation of same.


I think this change yields the following:


1. Creation of an external reference implementation of state.

2. Conversion of BloomFilter for interface to AbstractBloomFilter class

3. Merge BloomFilterFunctions into AbstractBloomFilter

4. Rename StandardBloomFilter to BloomFilter (should we do this?)

5. Modify (Standard)BloomFilter and CountingBloomFilter to extends the
AbstractBloomFilter class.


On the topic of an external reference implementation of state, could this
be an interface that AbstractBloomFilter implements? Might it comprise the
“read data” methods from the BitSet class?


[1] https://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message