commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claude Warren <cla...@xenei.com>
Subject Re: [collections] BloomFilter package architecture discussion
Date Thu, 17 Oct 2019 19:45:48 GMT
On Wed, Oct 16, 2019 at 2:08 AM Gilles Sadowski <gilleseran@gmail.com>
wrote:

> Hi.
>
> 2019-10-15 20:05 UTC+02:00, Claude Warren <claude@xenei.com>:
> > On Tue, Oct 15, 2019 at 1:46 AM Gilles Sadowski <gilleseran@gmail.com>
> > wrote:
> >
> >> Hello.
> >>
> >>
> >>
> >> > Furthermore,
> >> > the other potential users and supporters have not responded to any
> >> > communication about this proposal so I am floundering on that front
> >> > too.
> >>
> >> Who are they?
> >>
> >
> > Developers I have worked with or know of that have expertice utilizing
> and
> > analyzing Bloom filters in various projects and papers. Apache developers
> > that are considering adding Bloom filters to their projects.
>
> For sure, it would clear the path (to a design) if more people
> would lay out (usage) requirements.
>
> >
> >>
> >> > What I want to do is take a step back. Let’s remove the collection
> >> packages
> >> > from consideration and work out the base package first.
> >>
> >> +1
> >>
> >> > With that in mind I
> >> > would like to present reasoning behind the classes as they now stand
> >> > and
> >> > suggest a way forward to get the core classes in before adding the
> >> > collections.
> >>
> >> For sure, I appreciate your willingness to provide more explanations.
> :-)
> >>
> >> > As discussed earlier Bloom filters are a concept. In most peoples
> >> > mind’s
> >> > they are bit vectors with specific named functions (e.g. match => A
> and
> >> B =
> >> > A, hammingValue => cardinality(this)). This idea of the filter as a
> bit
> >> > vector led me to using a BitSet. Originally the BloomFilter was
> >> > serializable but valid arguments against that and toward making the
> >> > internal representation visible enought to serialize that lead to
> >> > making
> >> > the BitSet public.
> >>
> >> Maybe I'm missing your point, but I don't see a direct connection
> >> between persistence of a "BloomFillter" instance and the visibility
> >> of "BitSet" (if that is used to represent its internal state).
> >>
> >> Realistically, when we “persist” a Bloom filter we are preserving the
> > internal state. If the Bloom filter itself is not serializable then the
> > internal state has to be preserved somehow. You made a comment on Sep 23,
> > 2019 when we were discussing the serialization on the “New Sub-project
> > Proposal.” topic in which you said “At first sight, I'd say that
> > serialization is out-of-scope (we should let application developers deal
> > with that using the available accessors).” Perhaps I have misunderstood
> but
> > when I read that a light went on. What has to happen is that enough of
> the
> > internal state has to be exposed for the developer to preserve that state
> > and be able to reinstantiate it later. As the state is really an bit
> array
> > I figured the most logical thing would be to expose the BitSet.
> > Alternatively we could expose a ByteBuffer or any of a number of
> structures
> > that can represent a bit array. However, the BitSet is, semantically
> > speaking, what we are exposing.
>
> Maybe I was not clear enough: I'm not saying that we should prefer
> some representation (of the state) over another; only that the how
> the state is represented externally need not be the same as the internal
> representation.
>
> But if the state is fully encapsulated by a "BitSet", then internal and
> external can be the same, making easier both for developers and for
> users.  So let's have indeed a "getBitSet()" that returns a copy of the
> (private final) instance field (of type "BitSet") used to represent the
> state of a "BloomFilter".
>
>
I agree with most of this statement. However, I think the use of BitSet for
the internal representation is questionable. For the simple cases it works
and there is a whole class of Bloom filters where this would be acceptable.
However for filters that have some concept of a count (e.g. Counting Bloom
Filter, Stable Bloom Filters) it doesn’t make sense to force the use of the
BitSet internally. In addition, there are researchers that do not use
BitSet because it is too slow or too bloated. The exaples are
https://github.com/lemire/javaewah, and
https://github.com/RoaringBitmap/RoaringBitmap which are used in Apache
projects. The Dilemma her is to require BitSet or simply require BitSet
functionality. I think we should go with the later. I think we should
define an interface that represents the BitSet functionality and then
provide an imlementation that simply wraps the BitSet. We can use the
Implementation in our classes and use the interface where possible but it
leaves open the possibility of switching in a version of RoaringBitmap or
similar without changing the external API. It is this interface that is the
“external reference implementation of state”. I think I would like to call
this interface BitSetI and will refer to it this way later. I think it
should contain the following methods as defined in BitSet: and(), andNot(),
cardinality(), get(), intersects(), isEmpty(), nextClearBit(),
nextSetBit(), or(), previousClearBit(), previousSetBit(), stream(), xor().


> >> > In addition to the normal Bloom filter methods
> >> > (hammingValue, match, etc.) there are some methods (log, cosine
> >> > distance,
> >> > jaccard distance, etc.) that are used by researchers and other
> >> > attempting
> >> > to do some work with indexing and such for large collections of
> >> > filters.
> >>
> >> IIUC, you indeed distinguish between core functionality (i.e. "normal
> >> Bloom filter methods") and additional support for (?)...
> >> If so, "core" should indeed be implemented first, with maximum
> >> encapsulation (even if it would seem to prevent "research" usage).
> >>
> >> Indeed I distinguish. I distinguish to align various functionality with
> > external documentation. For example in Bloom’s paper he does not talk
> about
> > Hamming values a term and concept that is found in every Bloom filter
> > implementation I have seen and which is frequent in the literature. So a
> > “Strict” implementation of a Bloom filter might only have a single method
> > “match()” and still be considered an implementation of a Bloom filter,
> but
> > it would not be of much use in modern systems. Realistically, there are a
> > number of methods that are commonly found in Bloom filter
> implementations.
> > I believe that at a minimum they are “match()”, “merge()” and
> > “hammingValue()”.
>
> No problem with that: API should thus contain those instance methods.
> That's the kind of information we need from users of the functionality.
>
> >> > Finally there is the question of “normal” usage. In his paper on
> >> > “Compressed Bloom Filters”[1] (an implementation we do not have yet)
> >> > Michael Mitzenmacher states “the Bloom filter plays a dual role. It is
> >> both
> >> > a data structure being used at the proxies, and a message being passed
> >> > between them”. He is discussing the issue of using bloom filters in a
> >> > distributed cache where the proxies distributed their bloom filter so
> >> that
> >> > all the proxies know which proxy might have the document. So the Bloom
> >> > filter needs both an internal representation and a mechanism to pass
> >> > the
> >> > data between systems that use it. It seems obvious that the BitSet
> >> > works
> >> as
> >> > the mechanism to pass the data between systems, though an immutable
> >> version
> >> > would be preferred.
> >>
> >> So it seems we can have a Bloom filter state represented by a
> >> "private" and "final" instance of "BitSet".  Is there any reason that
> >> the implementation in [Collections] would ever need to use several
> >> representations of its state?
> >> If not, I'd think that "BloomFilter" can be a concrete class (with, for
> >> now, a "BitSet" as its state).
> >>
> >> The counter examples that come to mind is not within a single Bloom
> >> filter
> > type. A counting bloom filter tracks the number of times a bit has been
> set
> > so that it can do deletions (something standard Bloom filters have a hard
> > time with). The implementation in the code base simply extends the
> > StandardBloomFilter implementation and adds a sparse array of Integers as
> > the counts.
>
> So, in my understanding, it is an extension of the above "BitSet"
> state, and thus a valid case for inheriting from "BloomFilter" (as
> a concrete, not "abstract", class.
>
> > However, the counting Bloom filter does not need the BitSet
> > from StandardBloomFilter. It can generate one on demand.
>
> It can, but why would it reimplement the "getBitSet()" method,
> (and more) rather than just delegate to its base class (i.e. no
> duplication at all in the OO paradigm)?
>
>
The implementation may not want to use the BitSet class because the filter
itself can be large. For example in DNA matching it is not uncommon for the
Bloom filter to be 100s of MB in size. This also means we don’t want to
make copies if we don’t have to. (perhaps I should add DNA matching to the
Use cases to capture the large number of bits).

> The counting Bloom
> > filter needs to be able to determine if an AbstractBloomFilter matches
> its
> > pattern. It can do this internally without using a BitSet by scanning the
> > sparse array (though a bitset might prove faster). Finally the counting
> > Bloom filter can be merged into another bloom filter just like any other
> > AbstractBloomFilter can. It does that by basically turning on the bits in
> > the AbstractBloomFilter that are indexes for the entries in the internal
> > sparse array.
>
> IIUC "match" does not need any override, while "merge" will
> handle the sparse array, and then call "super.merge".  In a
> "CountingBloomFilter", there will be two "merge", and it can be
> construed as redundant, but I suppose that "merge" is called
> much less often than "match"; hence it could be an acceptable
> compromise for not having duplicate implementations (one
> with a "BitSet" and one with an array of integers).
> A "CountingBloomFilter" would provide an additional accessor:
> "getCounts()" that returns a copy of the array of integers (i.e.
> the part of the state which it manages).  [Again, it's up to the
> application developer to devise the strategy for serialization.]
>

I agree with most of this statement. I still don’t like forcing a standard
BitSet implementation onto all instances of Bloom filter when it is the
logical construct (BitSetI) that we are looking for.


>
> > There are several other examples such as layered Bloom
> > filters (
> https://en.wikipedia.org/wiki/Bloom_filter#Layered_Bloom_filters)
> > and attenuated Bloom filters (
> > https://en.wikipedia.org/wiki/Bloom_filter#Attenuated_Bloom_filters)
> where
> > the internal structure deviates from the BitSet model but that still need
> > to present themselves as AbstractBloomFilters do. I think that Abstract
> > class is the correct pattern here.
>
> Not sure: "abstract" is adequate when several subclasses need to
> override behaviours; whereas, here the behaviour is the same but
> the state is expanded.
>
>
I think abstract because the methods to perform the calculations that are
currently in BloomFilterFunctions would be implementable against BitSetI.
Those are the functions that go into the abstract class (along with BitSetI
methods).


> However, there may be a potential issue if the various variants above
> need to be mixed and matched.  I.e. will there be
>  * "LayeredBloomFilter
>  * "CountingLayeredBloomFilter"
>  * "AttenuatedLayeredBloomFilter"
>  * "CountingAttenuatedLayeredBloomFilter"
> ?
>
>
I don’t know of any such usage, however, nesting Bloom filters is done in
Attenuated filters and Layered Bloom filters use a number of Bloom filter
implementations internally.


> >> > So the Bloom filter needs an external reference implementation of its
> >> > state, so that it can communicate said state with other Bloom filters.
> >> The
> >> > quesiton of what is necessary for transmission of that state is still
> >> open.
> >> > I think we can agree a mechanism to identify which bits are on is
> >> requried
> >> > as part of the state. I have seen discussions that indicate the
> hashing
> >> > algorithm and other configuration information should also be included,
> >> but
> >> > I believe this is dependent upon where the filter is being sent. All
> >> Bloom
> >> > filters within a system, by necessity, have to use the same hash and
> >> > configuration parameters.
> >>
> >> The last statement is not obvious to me, but it surely looks like
> >> a simplifying assumption.
> >> What are examples of "system" in that sentence?
> >>
> >
> > In my statement above a “system” is a series of processes that accept
> Bloom
> > filters as arguments. For example a collections of 10K buckets. If you
> > created Bloom filter on each bucket, gathered the buckets in to groups of
> > 100 and build Bloom filters for each group you could quickly tell if and
> > where an object might be by checking the group bloom filters. But every
> > Bloom filter in this system must have the same “shape” (The same hash,
> > number of bits, and number of functions) for the comparisons to work. All
> > of those are in one “system”. However, a single application may have
> > multiple systems.
> >
>
> So IIUC, which "hash" and how many "functions" are also part of the
> state.  [Number of bits is already implicitly conveyed by "getBitSet()".]
>
>
Number of bits is not implicitly conveyed by getBitSet() as the BitSet will
only grow to the size needed to capture the highest bit currently on. So
the number of bits may be 72 but the highest on bit might be 50.

Hash, number of functions (k), number of items (n), and number of bits (m)
define the system in which the filter is expressed, call this the “Shape”.
It is not always necessary to convey the Shape when conveying the bloom
filter.


> Should the "hash" be configurable?  Or can it be an "implementation
> detail" (subject to change at the discretion of the library developers)?
>
>
This is an interesting question and one I have been wrestling with. It
seems that Cassandra has incorrectly implemented the murmur128 hash and can
not fix it because it would break everything. So it seem if we are to be
used in such places, the hash implementation should be pluggable. There are
several steps in the hashing that can lead to different results. I have
toyed with the idea of crateing a list of hashes and modifiers akin to the
strings used to describe encryption hashes/algorithms. So something like
“murmur128-UC” for the case were murmur128 hashing was used, values were
calculated Unsigned (U) and the circular (C) rehash strategy was used.
Other values woud be signed (S) and incremented (I). Though that discussion
really needs to be done farther afield.


> In the former case, I guess that we should create an "enum" that
> provides all the supported algorithms.
>
>
I think we are in general agreement here. An enum is possible.


> >>
> >> > Only when sending outside the system does the
> >> > other data need to be made available. In my mind this means those data
> >> > structures have to be availabe to be transmitted along with the Bloom
> >> > filter or filters.
> >>
> >> Even if so, this is about serialization/persistence; thus, not a "core"
> >> issue.
> >> If the above "system" is the application level, all "BloomFilter"
> >> instances
> >> should be able to pass messages between them using instance methods.
> >>
> >>
> > I believe this is a “core” issue. I used an example of a proxy system in
> my
> > original post. There is no requirement that each of those proxies be on a
> > single machine (logical, virtual or physical), in fact it could be argued
> > that they probably would not be on the same machine. So passing Bloom
> > filters between machines (I am reticent to say “system” here) is one of
> the
> > core requirements to making Bloom filter usage efficient in modern
> > distributed architectures.
>
> If both sides of the communication channel know the exact type
> of "BloomFilter" instances, the application can handle serialization
> and any other required functionality that may need to persist across
> the channel (e.g. the "counts").
>

Agreed.


> >> > So now to the code in the current contribution. The argument that
> >> > BloomFilter should be a class not an interface is a good one. However,
> >> > as
> >> > the internal representation of the bit structure can vary wildly,
> >>
> >> Why should it?
> >>
> >
> > See response at concreate/abstract discussion above.
>
> I need more details about the variants (cf. above indeed).
>
>
I am not sure how to respond here. There are a number of Bloom filter
implementations. There is an effective Bloom filter API. That API has
BitSetI methods and as well as merge(), match(), hammingValue(). But like
any API how those are implemented can vary dramatically. We’ve seen
Counting, we’ve touched on Stable and Attenuated. Those may or may not have
a BitSet implementation inside. I think there is a more exotic set of Bloom
filters that come out of the attempts to index large collections, but those
are definitely edge cases. I am currently working on a paper where we have
an index that can store hundreds of millions of bloom filters and access
them in constant time. It is effectively a merge of all of those filters.
It does not use a BitSet for its representation. It also does not currently
implement the BitSetI methods. It also support deletion.


> >>
> >> > I think
> >> > it should be an abstract class. My reasoning is as follows: Assuming
> we
> >> > have an external reference implementation of state many/all of the
> >> methods
> >> > could be implemented against that definition.
> >>
> >> I don't follow: OO is about behaviour(s), "state" being an
> implementation
> >> detail.
> >>
> >> This goes to the discussion of the BloomFilterFunctions and some methods
> > in the BloomFilterConfiguration class that estimate sizes. Those
> functions
> > can be implemented against the “external reference implementation of
> state”
> > which could be a BitSet. They all operate on the bit patterns using the
> > logical “or”, “and”, “xor” functions as well as cardinality, and
> > nextSetBit.
>
> No problem, except that IMHO they are "BitSet" functions!
> [And it might be worth implementing them as first class citizen,
> using the functional interfaces in "java.util.function".]
>
>
I think we agree here too.

>> > This is what the current
> >> > implementation attempts to do, but not successfully. Any
> implementation
> >> of
> >> > the abstract class could develop an internal representation of state
> >> > that
> >> > is more efficient for whatever the use case is (e.g. counting Bloom
> >> > filters, compressed Bloom filters, attenuated Bloom filters, etc)
> >>
> >> Are these variants impossible to represent with a "BitSet"?
> >>
> >> I think in all cases where the specific implementation of BloomFilter
> > needs to present itself as a BloomFilter, yes BitSet can represent that.
> > But is cases where the internal construct is much more complex it may not
> > make sense to try to keep an internal BitSet in sync with other
> structures
> > but rather generate the BitSet if and when it is necessary.
>
> That's a crucial data point...
> So as I wrote in the previous mail, that would indeed entail that
> "BloomFilter" is an interface.  And a common base class (using
> a "BitSet" for its state) makes no sense.
>

I think that we define the Bloom filter as implementing BitSetI and add the
methods for the abstract class I mentioned above. Now the concrete
implementation of BloomFilter can use BitSet as its internal state
representation and implement BitSetI to delegate to the BitSet instance.
Counting Blom filter can create an array of integers and implement BitSetI
based on the values in the array without the necessity of keeping a BitSet
in sync or taking the extra memory footprint. Attenuated, Layered, and
Stable Bloom filters can implement their internal state as makes sense for
the filter and still provide the BitSetI implementation with or without a
BitSet instance.

>
> >> > and
> >> > override the methods that are more efficiently answered within its
> >> internal
> >> > structure.
> >>
> >> Unless I'm missing the point, this then seems to mandate an
> >> interface (rather than an abstract or concrete class), contrary
> >> to what was said previously...
> >>
> >>
> > I think was was missing was the discussion of the Functions as noted in
> my
> > response to your comments about OO behaviour. I am sorry that I did not
> > make that clear in the original post, it probably would have reduced the
> > confusion significantly. Just to be clear. I think this is an Abstract
> > class that implements an “external reference implementation of state”
> > interface. That interface will then allow the abstract Bloom filter class
> > to implement the functions that are in the BloomFilterFunctions class
> (thus
> > removing the utility class). In addition they will permit the
> > BloomFilterConfigurtion class to continue to implement the estimated size
> > methods agains the AbstractBloomFilter class. I think the implementation
> of
> > the “external reference implementation of state” ends up being
> implemented
> > by the concrete Bloom filter classes. I don’t like the term “external
> > reference implementation of state” and wish I had a good interface name
> to
> > apply to it, but I started with it earlier and am sticking with it here
> > until it is either discarded or named.
>
> I'm lost here; I don't know what is "the" reference state, since a
> single structure cannot for example represent both a standard
> filter and a counting filter...
>
>
Think BitSetI.

>
> >
> >> > In addition, the abstract Bloom filter might implement the “data
> >> > read” methods from BitSet directly. This would allow more efficient
> >> > implementation of several methods that currently depend upon the
> >> > creation
> >> > of a BitSet and the manipulation of same.
> >>
> >> Sure; if any and all variants of a "BloomFilter" implementation can
> >> be created from a "BitSet", then such a factory method could be
> >> part of the "Builder" interface:
> >>
> >> public interface BloomFilter {
> >>     // Core methods...
> >>
> >>     public interface Builder {
> >>         public Builder from(BitSet s); // Factory method.
> >>         public BloomFilter build();
> >>         // Other builder methods.
> >>     }
> >> }
> >>
> >> Indeed, this is a valid construct. Currently the constructors take a
> > ProtoBloomFilter and BloomFilterConfiguration, or a BitSet. If the
> > “external reference implementation of state” is used then I expect to
> > change the second constructor to take an AbstractBloomFilter instead.
>
> Since I'm stuck with what the "reference state" is, I cannot follow...
>
>
Think BitSetI.


> > It is
> > the ProtoBloomFilter that has the Builder. The ProtoBloomFilter allows us
> > to build the hashes but not apply them to build the specific “shape”
> > (definition of “shape” noted above). Thus one ProtoBloomFilter can be the
> > progenitor of Bloom filters in multiple “systems” (defintion of “system”
> > noted above). This also means that the 10K bucket example I noted above
> can
> > become much more performant because we can segregate the bucket and group
> > Bloom filters into different “Systems” so that we get better collision
> > avoidance at the group level than we would otherwise. (I can provide data
> > and example if you wish.)
> >
>
> Still lost; examples may help.
> My impression is that we could come to that after deciding
> whether single inheritance will fit the bill or not.
>
>
Ok. Let’s put this aside until we get the other worked out, or until it
becomes critical for getting the othe worked out.


> >> >
> >> >
> >> > I think this change yields the following:
> >> >
> >> >
> >> > 1. Creation of an external reference implementation of state.
> >> >
> >> > 2. Conversion of BloomFilter for interface to AbstractBloomFilter
> class
> >> >
> >> > 3. Merge BloomFilterFunctions into AbstractBloomFilter
> >> >
> >> > 4. Rename StandardBloomFilter to BloomFilter (should we do this?)
> >> >
> >> > 5. Modify (Standard)BloomFilter and CountingBloomFilter to extends the
> >> > AbstractBloomFilter class.
> >>
> >> It seems that somehow we diverge on the emphasis given to the
> >> "state".  I think that the questions which I raised above should be
> >> answered before jumping to those conclusions.
> >>
> >
> > I agree and hope this response has clarified and answered.
>
> Not totally. :-}
>

Let's keep working then.


> >>
> >> > On the topic of an external reference implementation of state, could
> >> > this
> >> > be an interface that AbstractBloomFilter implements? Might it comprise
> >> the
> >> > “read data” methods from the BitSet class?
> >>
> >> IIUC, that will trivially be the case if "BitSet" can be the exchange
> >> format. And how to actually perform serialization is a user's issue
> >> (i.e. whether to rely on "BitSet" being "Serializable", or to implement
> >> a proxy...).
> >>
> >> To be honest, I had not worked through the “external reference
> > implementation of state” but initially thought it might be a BitSet.
> After
> > the above, perhaps not. Again, I haven’t quite thought it all through. If
> > “external reference implementation of state” really is an interface then
> it
> > must be possible to construct a Bloom filter from an instance of it.
> > Otherwise, developers would not be able to serialize the data,
> deserialize
> > into an instance of the interface and recreate the filter. On the other
> > hand, the more complext Bloom filters might require other data (e.g. a
> > sparse list of counts) to fully reconstruct them. These later cases are
> > probably best considered special cases for each filter type and addressed
> > within the specific implementation.
>
> So we came to an similar conclusion that there is definitely an
> issue if the "reference" is supposed to be able to represent all
> the filter variants.  If this is true, then we'd have a single
> "MegaBloomFilter" that combines "Standard", "Counting",
> "Layered", etc. and whatever additional functionality might be
> required later on...  But I may be missing many things now...
>
> I still think that talking about serialization does not help at all.
> An example (a unit test maybe) of what should be possible
> in the distributed architecture which you evoked would help
> fix ideas.
>
>
Perhaps we should be clear that we are talking about little “s”
serialization not the baggage that comes with “Serializable”. Yes, unit
tests would probably help.


> >> >
> >> > [1]
> https://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf
> >> >
> >>
> >
> > As an aside I was reading about "The Humpty Dumpty Principle" and came
> > across
> >
> https://definitionsinsemantics.blogspot.com/2012/03/humpty-dumpty-principle-in-definitions.html
> .
> > The article makes me wish I have never used the term “external reference
> > implementation of state”
> >
> > Thanks for your patience and for taking the time to read through all of
> > this,
> > Claude
>
> To be pragmatic, I wonder whether I should create a branch on
> within the Apache repository, so that all (including me) can
> contribute and actually test the various alternatives...
>
>
Is there a space in which to put non-code documents?  Things like the Use
cases and perhaps a design doc that describes the decisions we have made
and why.

I am going to create a branch in my repository and test the design to see
if what I envision makes sense.

One final issue. There was recently a post on the pull request asking why
the Bloom filter is an immutable object. We should probably explore why or
why not.

The biggest argument against is that the merge() method would requires that
the data be duplicated and it makes it hard to do in place updates. For
example when updating a “bucket” filter (as defined in the Use cases).

But if it is not immutable then the hashCode value changes making it
unsuitable for Hash sets. I think that there are classes in Commons that
have similar restrictions, and I think that documenting the issue in the
class javadoc as well as the hashCode() and equals() (assuming we actually
override them) would be acceptable.

Thank you,
Claude

-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message