commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gilles Sadowski <gillese...@gmail.com>
Subject Re: [collections] BloomFilter package architecture discussion
Date Fri, 18 Oct 2019 14:22:35 GMT
Hi.

>>> [...]
> >
> > Maybe I was not clear enough: I'm not saying that we should prefer
> > some representation (of the state) over another; only that the how
> > the state is represented externally need not be the same as the internal
> > representation.
> >
> > But if the state is fully encapsulated by a "BitSet", then internal and
> > external can be the same, making easier both for developers and for
> > users.  So let's have indeed a "getBitSet()" that returns a copy of the
> > (private final) instance field (of type "BitSet") used to represent the
> > state of a "BloomFilter".
> >
> >
> I agree with most of this statement. However, I think the use of BitSet for
> the internal representation is questionable.

Hence, that settles the issue of "class" versus "interface": "BloomFilter"
must be an interface.

> For the simple cases it works
> and there is a whole class of Bloom filters where this would be acceptable.
> However for filters that have some concept of a count (e.g. Counting Bloom
> Filter, Stable Bloom Filters) it doesn’t make sense to force the use of the
> BitSet internally.

If the "BitSet" instance field is private, there is no "forcing".
The issue is whether the API *must* provide a "toBitSet()" method.
The implementation will be trivial if the "BitSet" is the internal state;
but maybe not so otherwise, and perhaps an unnecessary (?) burden
on the implementations.

> In addition, there are researchers that do not use
> BitSet because it is too slow or too bloated.

Important point.
[I thought that "BitSet" was evoked because it perfectly fitted the bill.]

> The exaples are
> https://github.com/lemire/javaewah, and
> https://github.com/RoaringBitmap/RoaringBitmap which are used in Apache
> projects. The Dilemma her is to require BitSet or simply require BitSet
> functionality. I think we should go with the later.

Re to "bloated", we should go for *less* than "BitSet" (IIUC).

> I think we should
> define an interface that represents the BitSet functionality and then
> provide an imlementation that simply wraps the BitSet.

We are getting to operational small steps. :-)
I suggest defining the *basic* "BloomFilter" interface (*even* if you
already know, or suspect, there will be more methods) that defines the
required functionalities (i.e. "match", "merge", ...).
Then, *in the unit tests*, declare a class (say, "BitSetBloomFilter") that
implements "BloomFilter" and uses a "BitSet" as its state.
First ensure that this implementation actually works (i.e. provides the
necessary and sufficient "BitSet" functionality which you referred to
above).  Then, it will serve as a reference, in unit tests that will check
the implementations which the library will provide in order to satisfy the
actual use-cases.

> We can use the
> Implementation in our classes and use the interface where possible but it
> leaves open the possibility of switching in a version of RoaringBitmap or
> similar without changing the external API. It is this interface that is the
> “external reference implementation of state”. I think I would like to call
> this interface BitSetI and will refer to it this way later. I think it
> should contain the following methods as defined in BitSet: and(), andNot(),
> cardinality(), get(), intersects(), isEmpty(), nextClearBit(),
> nextSetBit(), or(), previousClearBit(), previousSetBit(), stream(), xor().

IMHO, this proposal orthogonal to the "BloomFilter" functionality, and
we must discuss separately whether [Collections] wants to host such an
API.

A priori, how the classes that implements the "BloomFilter" interface
represents the "set of bits" is an implementation detail.  Even that there
is a "set of bits" is a detail to *users* of the interface.
Of course, a practical question is whether the representation should be
"pluggable".  If not, we could change to using one or another from those
classes that you mentioned (and change it from version to version).
In any case, I think that this would be a feature to be left for later (and
that could be added in a compatible way, if nothing of the "state" is leaked
into the public API).

> > > [...]
> >
> > > However, the counting Bloom filter does not need the BitSet
> > > from StandardBloomFilter. It can generate one on demand.
> >
> > It can, but why would it reimplement the "getBitSet()" method,
> > (and more) rather than just delegate to its base class (i.e. no
> > duplication at all in the OO paradigm)?
> >
> >
> The implementation may not want to use the BitSet class because the filter
> itself can be large. For example in DNA matching it is not uncommon for the
> Bloom filter to be 100s of MB in size. This also means we don’t want to
> make copies if we don’t have to. (perhaps I should add DNA matching to the
> Use cases to capture the large number of bits).

So this use case would entail that the "BloomFilter" interface must *not*
mandate a "toBitSet()" method.

> > [...]
> >
>
> I agree with most of this statement. I still don’t like forcing a standard
> BitSet implementation onto all instances of Bloom filter

+1

> when it is the
> logical construct (BitSetI) that we are looking for.

-1
[IIUC; see my comments above.]

> >
> > > There are several other examples such as layered Bloom
> > > filters (
> > https://en.wikipedia.org/wiki/Bloom_filter#Layered_Bloom_filters)
> > > and attenuated Bloom filters (
> > > https://en.wikipedia.org/wiki/Bloom_filter#Attenuated_Bloom_filters)
> > where
> > > the internal structure deviates from the BitSet model but that still need
> > > to present themselves as AbstractBloomFilters do. I think that Abstract
> > > class is the correct pattern here.
> >
> > Not sure: "abstract" is adequate when several subclasses need to
> > override behaviours; whereas, here the behaviour is the same but
> > the state is expanded.
> >
> >
> I think abstract because the methods to perform the calculations that are
> currently in BloomFilterFunctions would be implementable against BitSetI.
> Those are the functions that go into the abstract class (along with BitSetI
> methods).

Ditto: these are details from the POV of "BloomFilter" users; of course,
we need code that will do the work on the whatever represents the "sets
of bits", but at first sight I don't think that we need to advertise a "BitSetI"
(even a "protected" one) in an abstract class, because it's unlikely that
any and all existing external implementations of "set of bits" will comply
with an interface defined here.

>
> > However, there may be a potential issue if the various variants above
> > need to be mixed and matched.  I.e. will there be
> >  * "LayeredBloomFilter
> >  * "CountingLayeredBloomFilter"
> >  * "AttenuatedLayeredBloomFilter"
> >  * "CountingAttenuatedLayeredBloomFilter"
> > ?
> >
> >
> I don’t know of any such usage, however, nesting Bloom filters is done in
> Attenuated filters and Layered Bloom filters use a number of Bloom filter
> implementations internally.

What do those use cases require from the "BloomFilter" interface?

> > > [...]
> >
> Number of bits is not implicitly conveyed by getBitSet() as the BitSet will
> only grow to the size needed to capture the highest bit currently on. So
> the number of bits may be 72 but the highest on bit might be 50.

OK; doesn't matter now since a common "BitSet" state has been
ruled out.
I get that "numberOfBits()" is one of the methods defined by
the "BloomFilter" interface.

>
> Hash, number of functions (k), number of items (n), and number of bits (m)
> define the system in which the filter is expressed, call this the “Shape”.
> It is not always necessary to convey the Shape when conveying the bloom
> filter.

Well, if it is *ever* necessary, then "BloomFilter" must declare a "shape()"
method.  Hence, it could be worth defining a "Shape" subinterface (?).

> > Should the "hash" be configurable?  Or can it be an "implementation
> > detail" (subject to change at the discretion of the library developers)?
> >
> >
> This is an interesting question and one I have been wrestling with. It
> seems that Cassandra has incorrectly implemented the murmur128 hash and can
> not fix it because it would break everything. So it seem if we are to be
> used in such places, the hash implementation should be pluggable.

Indeed.

> There are
> several steps in the hashing that can lead to different results. I have
> toyed with the idea of crateing a list of hashes and modifiers akin to the
> strings used to describe encryption hashes/algorithms. So something like
> “murmur128-UC” for the case were murmur128 hashing was used, values were
> calculated Unsigned (U) and the circular (C) rehash strategy was used.
> Other values woud be signed (S) and incremented (I). Though that discussion
> really needs to be done farther afield.
>
> > In the former case, I guess that we should create an "enum" that
> > provides all the supported algorithms.
> >
> >
> I think we are in general agreement here. An enum is possible.

"Commons RNG" could be an inspiration:
    http://commons.apache.org/proper/commons-rng/commons-rng-simple/apidocs/org/apache/commons/rng/simple/RandomSource.html

> > >> [...]
> >
> > If both sides of the communication channel know the exact type
> > of "BloomFilter" instances, the application can handle serialization
> > and any other required functionality that may need to persist across
> > the channel (e.g. the "counts").
> >
>
> Agreed.

Great. ;-)
[It was also a premise for "Commons RNG" that serialization is a user concern.]

> > >> > [...]
> >
> > I need more details about the variants (cf. above indeed).
> >
> >
> I am not sure how to respond here. There are a number of Bloom filter
> implementations. There is an effective Bloom filter API. That API has
> BitSetI methods and as well as merge(), match(), hammingValue(). But like
> any API how those are implemented can vary dramatically. We’ve seen
> Counting, we’ve touched on Stable and Attenuated. Those may or may not have
> a BitSet implementation inside. I think there is a more exotic set of Bloom
> filters that come out of the attempts to index large collections, but those
> are definitely edge cases. I am currently working on a paper where we have
> an index that can store hundreds of millions of bloom filters and access
> them in constant time. It is effectively a merge of all of those filters.
> It does not use a BitSet for its representation. It also does not currently
> implement the BitSetI methods. It also support deletion.

What I was saying above: a public "BitSetI" is not necessary for a
"BloomFilter" to work.
It would be nice if such a powerful functionality could be easily achieved
through "Commons" classes. ;-)

> > >>
> > >> [...] They all operate on the bit patterns using the
> > > logical “or”, “and”, “xor” functions as well as cardinality, and
> > > nextSetBit.
> >
> > No problem, except that IMHO they are "BitSet" functions!
> > [And it might be worth implementing them as first class citizen,
> > using the functional interfaces in "java.util.function".]
> >
> >
> I think we agree here too.

So, this is a potentially interesting but different contribution: Functional
interfaces for bit sets operations.

> >> >
> > > but rather generate the BitSet if and when it is necessary.

A previous argument (use case for very large filters) made this unwanted.

> >
> > That's a crucial data point...
> > So as I wrote in the previous mail, that would indeed entail that
> > "BloomFilter" is an interface.  And a common base class (using
> > a "BitSet" for its state) makes no sense.
> >
>
> I think that we define the Bloom filter as implementing BitSetI and add the
> methods for the abstract class I mentioned above. Now the concrete
> implementation of BloomFilter can use BitSet as its internal state
> representation and implement BitSetI to delegate to the BitSet instance.
> Counting Blom filter can create an array of integers and implement BitSetI
> based on the values in the array without the necessity of keeping a BitSet
> in sync or taking the extra memory footprint. Attenuated, Layered, and
> Stable Bloom filters can implement their internal state as makes sense for
> the filter and still provide the BitSetI implementation with or without a
> BitSet instance.

On GitHub, you gave Another argument against a full-fledged "BitSetI"
being part of the "BloomFilter" API: a filter instance should be immutable.

> > >> > [...]
> >
> > I'm lost here; I don't know what is "the" reference state, since a
> > single structure cannot for example represent both a standard
> > filter and a counting filter...
> >
> >
> Think BitSetI.

Still: No.
IMHO. ;-)

IIUC, for a Bloom filter,
    "state" = "shape" + "set of bits"

> > >
> > >> > In addition, the abstract Bloom filter might implement the “data
> > >> > read” methods from BitSet directly. This would allow more efficient
> > >> > implementation of several methods that currently depend upon the
> > >> > creation
> > >> > of a BitSet and the manipulation of same.
> > >>
> > >> Sure; if any and all variants of a "BloomFilter" implementation can
> > >> be created from a "BitSet", then such a factory method could be
> > >> part of the "Builder" interface:
> > >>
> > >> public interface BloomFilter {
> > >>     // Core methods...
> > >>
> > >>     public interface Builder {
> > >>         public Builder from(BitSet s); // Factory method.

As above, for large filters, this probably is not a good requirement.
Should it be
   public Builder from(BooleanStream s);
?
["BooleanStream" is not provided by the JDK...]

> > >>         public BloomFilter build();
> > >>         // Other builder methods.
> > >>     }
> > >> }
> > >>
> > >> Indeed, this is a valid construct. Currently the constructors take a
> > > ProtoBloomFilter and BloomFilterConfiguration, or a BitSet. If the
> > > “external reference implementation of state” is used then I expect to
> > > change the second constructor to take an AbstractBloomFilter instead.
> >
> > Since I'm stuck with what the "reference state" is, I cannot follow...
> >
> >
> Think BitSetI.

... ;-)

> > > It is
> > > the ProtoBloomFilter that has the Builder. The ProtoBloomFilter allows us
> > > to build the hashes but not apply them to build the specific “shape”
> > > (definition of “shape” noted above). Thus one ProtoBloomFilter can be the
> > > progenitor of Bloom filters in multiple “systems” (defintion of “system”
> > > noted above). This also means that the 10K bucket example I noted above
> > can
> > > become much more performant because we can segregate the bucket and group
> > > Bloom filters into different “Systems” so that we get better collision
> > > avoidance at the group level than we would otherwise. (I can provide data
> > > and example if you wish.)
> > >
> >
> > Still lost; examples may help.
> > My impression is that we could come to that after deciding
> > whether single inheritance will fit the bill or not.
> >
> >
> Ok. Let’s put this aside until we get the other worked out, or until it
> becomes critical for getting the othe worked out.

After the "BitSet" unit tests which I suggested, I think that
the next step would be the "Shape" interface and from there
"BloomFilterConfiguration" (which, I hope, could be more
self-documenting).

> > >> [...]
> > >
> > > I agree and hope this response has clarified and answered.
> >
> > Not totally. :-}
> >
>
> Let's keep working then.

The use-cases which you have posted are very useful; thanks.

> > >> [...]
> >
> > I still think that talking about serialization does not help at all.
> > An example (a unit test maybe) of what should be possible
> > in the distributed architecture which you evoked would help
> > fix ideas.
> >
> >
> Perhaps we should be clear that we are talking about little “s”
> serialization

Agreed but even so, it depends on practical use-cases requirements.
E.g. for the very large filters (ADN) case, I seem to understand that
whatever the serialization, it would not be efficient to pass around filters
between machines.

> not the baggage that comes with “Serializable”. Yes, unit
> tests would probably help.
>
>
> > >> >
> > >> > [1]
> > https://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf
> > > [...]
> >
> > To be pragmatic, I wonder whether I should create a branch on
> > within the Apache repository, so that all (including me) can
> > contribute and actually test the various alternatives...
> >
> >
> Is there a space in which to put non-code documents?  Things like the Use
> cases and perhaps a design doc that describes the decisions we have made
> and why.

Some projects have created a "userguide".  See e.g.:
    http://commons.apache.org/proper/commons-rng/userguide/rng.html

In that case, the HTML was auto-generated from a file in APT format:
    https://gitbox.apache.org/repos/asf?p=commons-rng.git;a=blob;f=src/site/apt/userguide/rng.apt

[Other formats are available and can be mixed.]

> I am going to create a branch in my repository and test the design to see
> if what I envision makes sense.

Please take into account that I (and others?) have zero practical
knowledge of this functionality.
Reviewing many classes at once is extremely time-consuming;
better IMO to start from the ground up and serialize (!) the issues.

> One final issue. There was recently a post on the pull request asking why
> the Bloom filter is an immutable object. We should probably explore why or
> why not.

Part of the answer is universal: because it is trivial to ensure a consistent
state (even in the face of concurrent usage).
The other part of is the justification, which you gave on GitHub, based on
expected usage.

> The biggest argument against is that the merge() method would requires that
> the data be duplicated and it makes it hard to do in place updates. For
> example when updating a “bucket” filter (as defined in the Use cases).

Now that "BloomFilter" is going to be a interface, some of its implementations
could be mutable (e.g. if it would be a requirement for its target usage).

However, arguments about performance should be backed with benchmarks...

In general, I think that code will be more robust if low-level classes are
immutable; and higher-level is responsible for managing instances (i.e.
replacing them if outdated).

> But if it is not immutable then the hashCode value changes making it
> unsuitable for Hash sets. I think that there are classes in Commons that
> have similar restrictions, and I think that documenting the issue in the
> class javadoc as well as the hashCode() and equals() (assuming we actually
> override them) would be acceptable.

Are there use cases where Bloom filters would be stored in a "HashSet"?

Best regards,
Gilles

>
> Thank you,
> Claude

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message