commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claude Warren <cla...@xenei.com>
Subject Re: [collections] BloomFilter package architecture discussion
Date Thu, 17 Oct 2019 15:47:13 GMT
I have added a list of Use Cases (
https://github.com/Claudenw/commons-collections/blob/master/src/main/java/org/apache/commons/collections4/bloomfilter/Usage.md)
I expect that these will end up in documentation somewhere.

On Wed, Oct 16, 2019 at 2:08 AM Gilles Sadowski <gilleseran@gmail.com>
wrote:

> Hi.
>
> 2019-10-15 20:05 UTC+02:00, Claude Warren <claude@xenei.com>:
> > On Tue, Oct 15, 2019 at 1:46 AM Gilles Sadowski <gilleseran@gmail.com>
> > wrote:
> >
> >> Hello.
> >>
> >>
> >>
> >> > Furthermore,
> >> > the other potential users and supporters have not responded to any
> >> > communication about this proposal so I am floundering on that front
> >> > too.
> >>
> >> Who are they?
> >>
> >
> > Developers I have worked with or know of that have expertice utilizing
> and
> > analyzing Bloom filters in various projects and papers. Apache developers
> > that are considering adding Bloom filters to their projects.
>
> For sure, it would clear the path (to a design) if more people
> would lay out (usage) requirements.
>
> >
> >>
> >> > What I want to do is take a step back. Let’s remove the collection
> >> packages
> >> > from consideration and work out the base package first.
> >>
> >> +1
> >>
> >> > With that in mind I
> >> > would like to present reasoning behind the classes as they now stand
> >> > and
> >> > suggest a way forward to get the core classes in before adding the
> >> > collections.
> >>
> >> For sure, I appreciate your willingness to provide more explanations.
> :-)
> >>
> >> > As discussed earlier Bloom filters are a concept. In most peoples
> >> > mind’s
> >> > they are bit vectors with specific named functions (e.g. match => A
> and
> >> B =
> >> > A, hammingValue => cardinality(this)). This idea of the filter as a
> bit
> >> > vector led me to using a BitSet. Originally the BloomFilter was
> >> > serializable but valid arguments against that and toward making the
> >> > internal representation visible enought to serialize that lead to
> >> > making
> >> > the BitSet public.
> >>
> >> Maybe I'm missing your point, but I don't see a direct connection
> >> between persistence of a "BloomFillter" instance and the visibility
> >> of "BitSet" (if that is used to represent its internal state).
> >>
> >> Realistically, when we “persist” a Bloom filter we are preserving the
> > internal state. If the Bloom filter itself is not serializable then the
> > internal state has to be preserved somehow. You made a comment on Sep 23,
> > 2019 when we were discussing the serialization on the “New Sub-project
> > Proposal.” topic in which you said “At first sight, I'd say that
> > serialization is out-of-scope (we should let application developers deal
> > with that using the available accessors).” Perhaps I have misunderstood
> but
> > when I read that a light went on. What has to happen is that enough of
> the
> > internal state has to be exposed for the developer to preserve that state
> > and be able to reinstantiate it later. As the state is really an bit
> array
> > I figured the most logical thing would be to expose the BitSet.
> > Alternatively we could expose a ByteBuffer or any of a number of
> structures
> > that can represent a bit array. However, the BitSet is, semantically
> > speaking, what we are exposing.
>
> Maybe I was not clear enough: I'm not saying that we should prefer
> some representation (of the state) over another; only that the how
> the state is represented externally need not be the same as the internal
> representation.
>
> But if the state is fully encapsulated by a "BitSet", then internal and
> external can be the same, making easier both for developers and for
> users.  So let's have indeed a "getBitSet()" that returns a copy of the
> (private final) instance field (of type "BitSet") used to represent the
> state of a "BloomFilter".
>
> >> > In addition to the normal Bloom filter methods
> >> > (hammingValue, match, etc.) there are some methods (log, cosine
> >> > distance,
> >> > jaccard distance, etc.) that are used by researchers and other
> >> > attempting
> >> > to do some work with indexing and such for large collections of
> >> > filters.
> >>
> >> IIUC, you indeed distinguish between core functionality (i.e. "normal
> >> Bloom filter methods") and additional support for (?)...
> >> If so, "core" should indeed be implemented first, with maximum
> >> encapsulation (even if it would seem to prevent "research" usage).
> >>
> >> Indeed I distinguish. I distinguish to align various functionality with
> > external documentation. For example in Bloom’s paper he does not talk
> about
> > Hamming values a term and concept that is found in every Bloom filter
> > implementation I have seen and which is frequent in the literature. So a
> > “Strict” implementation of a Bloom filter might only have a single method
> > “match()” and still be considered an implementation of a Bloom filter,
> but
> > it would not be of much use in modern systems. Realistically, there are a
> > number of methods that are commonly found in Bloom filter
> implementations.
> > I believe that at a minimum they are “match()”, “merge()” and
> > “hammingValue()”.
>
> No problem with that: API should thus contain those instance methods.
> That's the kind of information we need from users of the functionality.
>
> >> > Finally there is the question of “normal” usage. In his paper on
> >> > “Compressed Bloom Filters”[1] (an implementation we do not have yet)
> >> > Michael Mitzenmacher states “the Bloom filter plays a dual role. It is
> >> both
> >> > a data structure being used at the proxies, and a message being passed
> >> > between them”. He is discussing the issue of using bloom filters in a
> >> > distributed cache where the proxies distributed their bloom filter so
> >> that
> >> > all the proxies know which proxy might have the document. So the Bloom
> >> > filter needs both an internal representation and a mechanism to pass
> >> > the
> >> > data between systems that use it. It seems obvious that the BitSet
> >> > works
> >> as
> >> > the mechanism to pass the data between systems, though an immutable
> >> version
> >> > would be preferred.
> >>
> >> So it seems we can have a Bloom filter state represented by a
> >> "private" and "final" instance of "BitSet".  Is there any reason that
> >> the implementation in [Collections] would ever need to use several
> >> representations of its state?
> >> If not, I'd think that "BloomFilter" can be a concrete class (with, for
> >> now, a "BitSet" as its state).
> >>
> >> The counter examples that come to mind is not within a single Bloom
> >> filter
> > type. A counting bloom filter tracks the number of times a bit has been
> set
> > so that it can do deletions (something standard Bloom filters have a hard
> > time with). The implementation in the code base simply extends the
> > StandardBloomFilter implementation and adds a sparse array of Integers as
> > the counts.
>
> So, in my understanding, it is an extension of the above "BitSet"
> state, and thus a valid case for inheriting from "BloomFilter" (as
> a concrete, not "abstract", class.
>
> > However, the counting Bloom filter does not need the BitSet
> > from StandardBloomFilter. It can generate one on demand.
>
> It can, but why would it reimplement the "getBitSet()" method,
> (and more) rather than just delegate to its base class (i.e. no
> duplication at all in the OO paradigm)?
>
> > The counting Bloom
> > filter needs to be able to determine if an AbstractBloomFilter matches
> its
> > pattern. It can do this internally without using a BitSet by scanning the
> > sparse array (though a bitset might prove faster). Finally the counting
> > Bloom filter can be merged into another bloom filter just like any other
> > AbstractBloomFilter can. It does that by basically turning on the bits in
> > the AbstractBloomFilter that are indexes for the entries in the internal
> > sparse array.
>
> IIUC "match" does not need any override, while "merge" will
> handle the sparse array, and then call "super.merge".  In a
> "CountingBloomFilter", there will be two "merge", and it can be
> construed as redundant, but I suppose that "merge" is called
> much less often than "match"; hence it could be an acceptable
> compromise for not having duplicate implementations (one
> with a "BitSet" and one with an array of integers).
> A "CountingBloomFilter" would provide an additional accessor:
> "getCounts()" that returns a copy of the array of integers (i.e.
> the part of the state which it manages).  [Again, it's up to the
> application developer to devise the strategy for serialization.]
>
> > There are several other examples such as layered Bloom
> > filters (
> https://en.wikipedia.org/wiki/Bloom_filter#Layered_Bloom_filters)
> > and attenuated Bloom filters (
> > https://en.wikipedia.org/wiki/Bloom_filter#Attenuated_Bloom_filters)
> where
> > the internal structure deviates from the BitSet model but that still need
> > to present themselves as AbstractBloomFilters do. I think that Abstract
> > class is the correct pattern here.
>
> Not sure: "abstract" is adequate when several subclasses need to
> override behaviours; whereas, here the behaviour is the same but
> the state is expanded.
>
> However, there may be a potential issue if the various variants above
> need to be mixed and matched.  I.e. will there be
>  * "LayeredBloomFilter
>  * "CountingLayeredBloomFilter"
>  * "AttenuatedLayeredBloomFilter"
>  * "CountingAttenuatedLayeredBloomFilter"
> ?
>
> >> > So the Bloom filter needs an external reference implementation of its
> >> > state, so that it can communicate said state with other Bloom filters.
> >> The
> >> > quesiton of what is necessary for transmission of that state is still
> >> open.
> >> > I think we can agree a mechanism to identify which bits are on is
> >> requried
> >> > as part of the state. I have seen discussions that indicate the
> hashing
> >> > algorithm and other configuration information should also be included,
> >> but
> >> > I believe this is dependent upon where the filter is being sent. All
> >> Bloom
> >> > filters within a system, by necessity, have to use the same hash and
> >> > configuration parameters.
> >>
> >> The last statement is not obvious to me, but it surely looks like
> >> a simplifying assumption.
> >> What are examples of "system" in that sentence?
> >>
> >
> > In my statement above a “system” is a series of processes that accept
> Bloom
> > filters as arguments. For example a collections of 10K buckets. If you
> > created Bloom filter on each bucket, gathered the buckets in to groups of
> > 100 and build Bloom filters for each group you could quickly tell if and
> > where an object might be by checking the group bloom filters. But every
> > Bloom filter in this system must have the same “shape” (The same hash,
> > number of bits, and number of functions) for the comparisons to work. All
> > of those are in one “system”. However, a single application may have
> > multiple systems.
> >
>
> So IIUC, which "hash" and how many "functions" are also part of the
> state.  [Number of bits is already implicitly conveyed by "getBitSet()".]
>
> Should the "hash" be configurable?  Or can it be an "implementation
> detail" (subject to change at the discretion of the library developers)?
>
> In the former case, I guess that we should create an "enum" that
> provides all the supported algorithms.
>
> >>
> >> > Only when sending outside the system does the
> >> > other data need to be made available. In my mind this means those data
> >> > structures have to be availabe to be transmitted along with the Bloom
> >> > filter or filters.
> >>
> >> Even if so, this is about serialization/persistence; thus, not a "core"
> >> issue.
> >> If the above "system" is the application level, all "BloomFilter"
> >> instances
> >> should be able to pass messages between them using instance methods.
> >>
> >>
> > I believe this is a “core” issue. I used an example of a proxy system in
> my
> > original post. There is no requirement that each of those proxies be on a
> > single machine (logical, virtual or physical), in fact it could be argued
> > that they probably would not be on the same machine. So passing Bloom
> > filters between machines (I am reticent to say “system” here) is one of
> the
> > core requirements to making Bloom filter usage efficient in modern
> > distributed architectures.
>
> If both sides of the communication channel know the exact type
> of "BloomFilter" instances, the application can handle serialization
> and any other required functionality that may need to persist across
> the channel (e.g. the "counts").
>
> >> > So now to the code in the current contribution. The argument that
> >> > BloomFilter should be a class not an interface is a good one. However,
> >> > as
> >> > the internal representation of the bit structure can vary wildly,
> >>
> >> Why should it?
> >>
> >
> > See response at concreate/abstract discussion above.
>
> I need more details about the variants (cf. above indeed).
>
> >>
> >> > I think
> >> > it should be an abstract class. My reasoning is as follows: Assuming
> we
> >> > have an external reference implementation of state many/all of the
> >> methods
> >> > could be implemented against that definition.
> >>
> >> I don't follow: OO is about behaviour(s), "state" being an
> implementation
> >> detail.
> >>
> >> This goes to the discussion of the BloomFilterFunctions and some methods
> > in the BloomFilterConfiguration class that estimate sizes. Those
> functions
> > can be implemented against the “external reference implementation of
> state”
> > which could be a BitSet. They all operate on the bit patterns using the
> > logical “or”, “and”, “xor” functions as well as cardinality, and
> > nextSetBit.
>
> No problem, except that IMHO they are "BitSet" functions!
> [And it might be worth implementing them as first class citizen,
> using the functional interfaces in "java.util.function".]
>
> >> > This is what the current
> >> > implementation attempts to do, but not successfully. Any
> implementation
> >> of
> >> > the abstract class could develop an internal representation of state
> >> > that
> >> > is more efficient for whatever the use case is (e.g. counting Bloom
> >> > filters, compressed Bloom filters, attenuated Bloom filters, etc)
> >>
> >> Are these variants impossible to represent with a "BitSet"?
> >>
> >> I think in all cases where the specific implementation of BloomFilter
> > needs to present itself as a BloomFilter, yes BitSet can represent that.
> > But is cases where the internal construct is much more complex it may not
> > make sense to try to keep an internal BitSet in sync with other
> structures
> > but rather generate the BitSet if and when it is necessary.
>
> That's a crucial data point...
> So as I wrote in the previous mail, that would indeed entail that
> "BloomFilter" is an interface.  And a common base class (using
> a "BitSet" for its state) makes no sense.
>
> >> > and
> >> > override the methods that are more efficiently answered within its
> >> internal
> >> > structure.
> >>
> >> Unless I'm missing the point, this then seems to mandate an
> >> interface (rather than an abstract or concrete class), contrary
> >> to what was said previously...
> >>
> >>
> > I think was was missing was the discussion of the Functions as noted in
> my
> > response to your comments about OO behaviour. I am sorry that I did not
> > make that clear in the original post, it probably would have reduced the
> > confusion significantly. Just to be clear. I think this is an Abstract
> > class that implements an “external reference implementation of state”
> > interface. That interface will then allow the abstract Bloom filter class
> > to implement the functions that are in the BloomFilterFunctions class
> (thus
> > removing the utility class). In addition they will permit the
> > BloomFilterConfigurtion class to continue to implement the estimated size
> > methods agains the AbstractBloomFilter class. I think the implementation
> of
> > the “external reference implementation of state” ends up being
> implemented
> > by the concrete Bloom filter classes. I don’t like the term “external
> > reference implementation of state” and wish I had a good interface name
> to
> > apply to it, but I started with it earlier and am sticking with it here
> > until it is either discarded or named.
>
> I'm lost here; I don't know what is "the" reference state, since a
> single structure cannot for example represent both a standard
> filter and a counting filter...
>
> >
> >
> >> > In addition, the abstract Bloom filter might implement the “data
> >> > read” methods from BitSet directly. This would allow more efficient
> >> > implementation of several methods that currently depend upon the
> >> > creation
> >> > of a BitSet and the manipulation of same.
> >>
> >> Sure; if any and all variants of a "BloomFilter" implementation can
> >> be created from a "BitSet", then such a factory method could be
> >> part of the "Builder" interface:
> >>
> >> public interface BloomFilter {
> >>     // Core methods...
> >>
> >>     public interface Builder {
> >>         public Builder from(BitSet s); // Factory method.
> >>         public BloomFilter build();
> >>         // Other builder methods.
> >>     }
> >> }
> >>
> >> Indeed, this is a valid construct. Currently the constructors take a
> > ProtoBloomFilter and BloomFilterConfiguration, or a BitSet. If the
> > “external reference implementation of state” is used then I expect to
> > change the second constructor to take an AbstractBloomFilter instead.
>
> Since I'm stuck with what the "reference state" is, I cannot follow...
>
> > It is
> > the ProtoBloomFilter that has the Builder. The ProtoBloomFilter allows us
> > to build the hashes but not apply them to build the specific “shape”
> > (definition of “shape” noted above). Thus one ProtoBloomFilter can be the
> > progenitor of Bloom filters in multiple “systems” (defintion of “system”
> > noted above). This also means that the 10K bucket example I noted above
> can
> > become much more performant because we can segregate the bucket and group
> > Bloom filters into different “Systems” so that we get better collision
> > avoidance at the group level than we would otherwise. (I can provide data
> > and example if you wish.)
> >
>
> Still lost; examples may help.
> My impression is that we could come to that after deciding
> whether single inheritance will fit the bill or not.
>
> >> >
> >> >
> >> > I think this change yields the following:
> >> >
> >> >
> >> > 1. Creation of an external reference implementation of state.
> >> >
> >> > 2. Conversion of BloomFilter for interface to AbstractBloomFilter
> class
> >> >
> >> > 3. Merge BloomFilterFunctions into AbstractBloomFilter
> >> >
> >> > 4. Rename StandardBloomFilter to BloomFilter (should we do this?)
> >> >
> >> > 5. Modify (Standard)BloomFilter and CountingBloomFilter to extends the
> >> > AbstractBloomFilter class.
> >>
> >> It seems that somehow we diverge on the emphasis given to the
> >> "state".  I think that the questions which I raised above should be
> >> answered before jumping to those conclusions.
> >>
> >
> > I agree and hope this response has clarified and answered.
>
> Not totally. :-}
>
> >>
> >> > On the topic of an external reference implementation of state, could
> >> > this
> >> > be an interface that AbstractBloomFilter implements? Might it comprise
> >> the
> >> > “read data” methods from the BitSet class?
> >>
> >> IIUC, that will trivially be the case if "BitSet" can be the exchange
> >> format. And how to actually perform serialization is a user's issue
> >> (i.e. whether to rely on "BitSet" being "Serializable", or to implement
> >> a proxy...).
> >>
> >> To be honest, I had not worked through the “external reference
> > implementation of state” but initially thought it might be a BitSet.
> After
> > the above, perhaps not. Again, I haven’t quite thought it all through. If
> > “external reference implementation of state” really is an interface then
> it
> > must be possible to construct a Bloom filter from an instance of it.
> > Otherwise, developers would not be able to serialize the data,
> deserialize
> > into an instance of the interface and recreate the filter. On the other
> > hand, the more complext Bloom filters might require other data (e.g. a
> > sparse list of counts) to fully reconstruct them. These later cases are
> > probably best considered special cases for each filter type and addressed
> > within the specific implementation.
>
> So we came to an similar conclusion that there is definitely an
> issue if the "reference" is supposed to be able to represent all
> the filter variants.  If this is true, then we'd have a single
> "MegaBloomFilter" that combines "Standard", "Counting",
> "Layered", etc. and whatever additional functionality might be
> required later on...  But I may be missing many things now...
>
> I still think that talking about serialization does not help at all.
> An example (a unit test maybe) of what should be possible
> in the distributed architecture which you evoked would help
> fix ideas.
>
> >> >
> >> > [1]
> https://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf
> >> >
> >>
> >
> > As an aside I was reading about "The Humpty Dumpty Principle" and came
> > across
> >
> https://definitionsinsemantics.blogspot.com/2012/03/humpty-dumpty-principle-in-definitions.html
> .
> > The article makes me wish I have never used the term “external reference
> > implementation of state”
> >
> > Thanks for your patience and for taking the time to read through all of
> > this,
> > Claude
>
> To be pragmatic, I wonder whether I should create a branch on
> within the Apache repository, so that all (including me) can
> contribute and actually test the various alternatives...
>
> Regards,
> Gilles
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message