commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claude Warren <cla...@xenei.com>
Subject Re: [collections] BloomFilter package architecture discussion
Date Tue, 15 Oct 2019 18:05:37 GMT
On Tue, Oct 15, 2019 at 1:46 AM Gilles Sadowski <gilleseran@gmail.com>
wrote:

> Hello.
>
>
>
> > Furthermore,
> > the other potential users and supporters have not responded to any
> > communication about this proposal so I am floundering on that front too.
>
> Who are they?
>

Developers I have worked with or know of that have expertice utilizing and
analyzing Bloom filters in various projects and papers. Apache developers
that are considering adding Bloom filters to their projects.


>
> > What I want to do is take a step back. Let’s remove the collection
> packages
> > from consideration and work out the base package first.
>
> +1
>
> > With that in mind I
> > would like to present reasoning behind the classes as they now stand and
> > suggest a way forward to get the core classes in before adding the
> > collections.
>
> For sure, I appreciate your willingness to provide more explanations. :-)
>
> > As discussed earlier Bloom filters are a concept. In most peoples mind’s
> > they are bit vectors with specific named functions (e.g. match => A and
> B =
> > A, hammingValue => cardinality(this)). This idea of the filter as a bit
> > vector led me to using a BitSet. Originally the BloomFilter was
> > serializable but valid arguments against that and toward making the
> > internal representation visible enought to serialize that lead to making
> > the BitSet public.
>
> Maybe I'm missing your point, but I don't see a direct connection
> between persistence of a "BloomFillter" instance and the visibility
> of "BitSet" (if that is used to represent its internal state).
>
> Realistically, when we “persist” a Bloom filter we are preserving the
internal state. If the Bloom filter itself is not serializable then the
internal state has to be preserved somehow. You made a comment on Sep 23,
2019 when we were discussing the serialization on the “New Sub-project
Proposal.” topic in which you said “At first sight, I'd say that
serialization is out-of-scope (we should let application developers deal
with that using the available accessors).” Perhaps I have misunderstood but
when I read that a light went on. What has to happen is that enough of the
internal state has to be exposed for the developer to preserve that state
and be able to reinstantiate it later. As the state is really an bit array
I figured the most logical thing would be to expose the BitSet.
Alternatively we could expose a ByteBuffer or any of a number of structures
that can represent a bit array. However, the BitSet is, semantically
speaking, what we are exposing.


> > In addition to the normal Bloom filter methods
> > (hammingValue, match, etc.) there are some methods (log, cosine distance,
> > jaccard distance, etc.) that are used by researchers and other attempting
> > to do some work with indexing and such for large collections of filters.
>
> IIUC, you indeed distinguish between core functionality (i.e. "normal
> Bloom filter methods") and additional support for (?)...
> If so, "core" should indeed be implemented first, with maximum
> encapsulation (even if it would seem to prevent "research" usage).
>
> Indeed I distinguish. I distinguish to align various functionality with
external documentation. For example in Bloom’s paper he does not talk about
Hamming values a term and concept that is found in every Bloom filter
implementation I have seen and which is frequent in the literature. So a
“Strict” implementation of a Bloom filter might only have a single method
“match()” and still be considered an implementation of a Bloom filter, but
it would not be of much use in modern systems. Realistically, there are a
number of methods that are commonly found in Bloom filter implementations.
I believe that at a minimum they are “match()”, “merge()” and
“hammingValue()”.



> > Finally there is the question of “normal” usage. In his paper on
> > “Compressed Bloom Filters”[1] (an implementation we do not have yet)
> > Michael Mitzenmacher states “the Bloom filter plays a dual role. It is
> both
> > a data structure being used at the proxies, and a message being passed
> > between them”. He is discussing the issue of using bloom filters in a
> > distributed cache where the proxies distributed their bloom filter so
> that
> > all the proxies know which proxy might have the document. So the Bloom
> > filter needs both an internal representation and a mechanism to pass the
> > data between systems that use it. It seems obvious that the BitSet works
> as
> > the mechanism to pass the data between systems, though an immutable
> version
> > would be preferred.
>
> So it seems we can have a Bloom filter state represented by a
> "private" and "final" instance of "BitSet".  Is there any reason that
> the implementation in [Collections] would ever need to use several
> representations of its state?
> If not, I'd think that "BloomFilter" can be a concrete class (with, for
> now, a "BitSet" as its state).
>
> The counter examples that come to mind is not within a single Bloom filter
type. A counting bloom filter tracks the number of times a bit has been set
so that it can do deletions (something standard Bloom filters have a hard
time with). The implementation in the code base simply extends the
StandardBloomFilter implementation and adds a sparse array of Integers as
the counts. However, the counting Bloom filter does not need the BitSet
from StandardBloomFilter. It can generate one on demand. The counting Bloom
filter needs to be able to determine if an AbstractBloomFilter matches its
pattern. It can do this internally without using a BitSet by scanning the
sparse array (though a bitset might prove faster). Finally the counting
Bloom filter can be merged into another bloom filter just like any other
AbstractBloomFilter can. It does that by basically turning on the bits in
the AbstractBloomFilter that are indexes for the entries in the internal
sparse array. There are several other examples such as layered Bloom
filters (https://en.wikipedia.org/wiki/Bloom_filter#Layered_Bloom_filters)
and attenuated Bloom filters (
https://en.wikipedia.org/wiki/Bloom_filter#Attenuated_Bloom_filters) where
the internal structure deviates from the BitSet model but that still need
to present themselves as AbstractBloomFilters do. I think that Abstract
class is the correct pattern here.


> > So the Bloom filter needs an external reference implementation of its
> > state, so that it can communicate said state with other Bloom filters.
> The
> > quesiton of what is necessary for transmission of that state is still
> open.
> > I think we can agree a mechanism to identify which bits are on is
> requried
> > as part of the state. I have seen discussions that indicate the hashing
> > algorithm and other configuration information should also be included,
> but
> > I believe this is dependent upon where the filter is being sent. All
> Bloom
> > filters within a system, by necessity, have to use the same hash and
> > configuration parameters.
>
> The last statement is not obvious to me, but it surely looks like
> a simplifying assumption.
> What are examples of "system" in that sentence?
>

In my statement above a “system” is a series of processes that accept Bloom
filters as arguments. For example a collections of 10K buckets. If you
created Bloom filter on each bucket, gathered the buckets in to groups of
100 and build Bloom filters for each group you could quickly tell if and
where an object might be by checking the group bloom filters. But every
Bloom filter in this system must have the same “shape” (The same hash,
number of bits, and number of functions) for the comparisons to work. All
of those are in one “system”. However, a single application may have
multiple systems.



>
> > Only when sending outside the system does the
> > other data need to be made available. In my mind this means those data
> > structures have to be availabe to be transmitted along with the Bloom
> > filter or filters.
>
> Even if so, this is about serialization/persistence; thus, not a "core"
> issue.
> If the above "system" is the application level, all "BloomFilter" instances
> should be able to pass messages between them using instance methods.
>
>
I believe this is a “core” issue. I used an example of a proxy system in my
original post. There is no requirement that each of those proxies be on a
single machine (logical, virtual or physical), in fact it could be argued
that they probably would not be on the same machine. So passing Bloom
filters between machines (I am reticent to say “system” here) is one of the
core requirements to making Bloom filter usage efficient in modern
distributed architectures.


> > So now to the code in the current contribution. The argument that
> > BloomFilter should be a class not an interface is a good one. However, as
> > the internal representation of the bit structure can vary wildly,
>
> Why should it?
>

See response at concreate/abstract discussion above.


>
> > I think
> > it should be an abstract class. My reasoning is as follows: Assuming we
> > have an external reference implementation of state many/all of the
> methods
> > could be implemented against that definition.
>
> I don't follow: OO is about behaviour(s), "state" being an implementation
> detail.
>
> This goes to the discussion of the BloomFilterFunctions and some methods
in the BloomFilterConfiguration class that estimate sizes. Those functions
can be implemented against the “external reference implementation of state”
which could be a BitSet. They all operate on the bit patterns using the
logical “or”, “and”, “xor” functions as well as cardinality, and
nextSetBit.


> > This is what the current
> > implementation attempts to do, but not successfully. Any implementation
> of
> > the abstract class could develop an internal representation of state that
> > is more efficient for whatever the use case is (e.g. counting Bloom
> > filters, compressed Bloom filters, attenuated Bloom filters, etc)
>
> Are these variants impossible to represent with a "BitSet"?
>
> I think in all cases where the specific implementation of BloomFilter
needs to present itself as a BloomFilter, yes BitSet can represent that.
But is cases where the internal construct is much more complex it may not
make sense to try to keep an internal BitSet in sync with other structures
but rather generate the BitSet if and when it is necessary.


> > and
> > override the methods that are more efficiently answered within its
> internal
> > structure.
>
> Unless I'm missing the point, this then seems to mandate an
> interface (rather than an abstract or concrete class), contrary
> to what was said previously...
>
>
I think was was missing was the discussion of the Functions as noted in my
response to your comments about OO behaviour. I am sorry that I did not
make that clear in the original post, it probably would have reduced the
confusion significantly. Just to be clear. I think this is an Abstract
class that implements an “external reference implementation of state”
interface. That interface will then allow the abstract Bloom filter class
to implement the functions that are in the BloomFilterFunctions class (thus
removing the utility class). In addition they will permit the
BloomFilterConfigurtion class to continue to implement the estimated size
methods agains the AbstractBloomFilter class. I think the implementation of
the “external reference implementation of state” ends up being implemented
by the concrete Bloom filter classes. I don’t like the term “external
reference implementation of state” and wish I had a good interface name to
apply to it, but I started with it earlier and am sticking with it here
until it is either discarded or named.


> > In addition, the abstract Bloom filter might implement the “data
> > read” methods from BitSet directly. This would allow more efficient
> > implementation of several methods that currently depend upon the creation
> > of a BitSet and the manipulation of same.
>
> Sure; if any and all variants of a "BloomFilter" implementation can
> be created from a "BitSet", then such a factory method could be
> part of the "Builder" interface:
>
> public interface BloomFilter {
>     // Core methods...
>
>     public interface Builder {
>         public Builder from(BitSet s); // Factory method.
>         public BloomFilter build();
>         // Other builder methods.
>     }
> }
>
> Indeed, this is a valid construct. Currently the constructors take a
ProtoBloomFilter and BloomFilterConfiguration, or a BitSet. If the
“external reference implementation of state” is used then I expect to
change the second constructor to take an AbstractBloomFilter instead. It is
the ProtoBloomFilter that has the Builder. The ProtoBloomFilter allows us
to build the hashes but not apply them to build the specific “shape”
(definition of “shape” noted above). Thus one ProtoBloomFilter can be the
progenitor of Bloom filters in multiple “systems” (defintion of “system”
noted above). This also means that the 10K bucket example I noted above can
become much more performant because we can segregate the bucket and group
Bloom filters into different “Systems” so that we get better collision
avoidance at the group level than we would otherwise. (I can provide data
and example if you wish.)


> >
> >
> > I think this change yields the following:
> >
> >
> > 1. Creation of an external reference implementation of state.
> >
> > 2. Conversion of BloomFilter for interface to AbstractBloomFilter class
> >
> > 3. Merge BloomFilterFunctions into AbstractBloomFilter
> >
> > 4. Rename StandardBloomFilter to BloomFilter (should we do this?)
> >
> > 5. Modify (Standard)BloomFilter and CountingBloomFilter to extends the
> > AbstractBloomFilter class.
>
> It seems that somehow we diverge on the emphasis given to the
> "state".  I think that the questions which I raised above should be
> answered before jumping to those conclusions.
>

I agree and hope this response has clarified and answered.


>
> > On the topic of an external reference implementation of state, could this
> > be an interface that AbstractBloomFilter implements? Might it comprise
> the
> > “read data” methods from the BitSet class?
>
> IIUC, that will trivially be the case if "BitSet" can be the exchange
> format. And how to actually perform serialization is a user's issue
> (i.e. whether to rely on "BitSet" being "Serializable", or to implement
> a proxy...).
>
> To be honest, I had not worked through the “external reference
implementation of state” but initially thought it might be a BitSet. After
the above, perhaps not. Again, I haven’t quite thought it all through. If
“external reference implementation of state” really is an interface then it
must be possible to construct a Bloom filter from an instance of it.
Otherwise, developers would not be able to serialize the data, deserialize
into an instance of the interface and recreate the filter. On the other
hand, the more complext Bloom filters might require other data (e.g. a
sparse list of counts) to fully reconstruct them. These later cases are
probably best considered special cases for each filter type and addressed
within the specific implementation.


> Regards,
> Gilles
>
> >
> > [1] https://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf
> >
>

As an aside I was reading about "The Humpty Dumpty Principle" and came
across
https://definitionsinsemantics.blogspot.com/2012/03/humpty-dumpty-principle-in-definitions.html.
The article makes me wish I have never used the term “external reference
implementation of state”

Thanks for your patience and for taking the time to read through all of
this,
Claude
-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message