commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claude Warren <cla...@xenei.com>
Subject Re: [collections] BloomFilter package architecture discussion
Date Sun, 20 Oct 2019 12:19:59 GMT
A Bloom filter is a set of bits:

"The hash area is considered as N individual addressable bits, with
addresses 0 through N - 1. It is assumed that all bits in the hash area are
first set to 0. Next, each mes- sage in the set to be stored is hash coded
into a number of distinct bit addresses, say al, a2, ..., ad. Finally, all
d bits addressed by al through ad are set to 1." -- Burton Bloom,
"Space/Time Trade-offs in Hash Coding with Allowable Errors"[1]

Claude

[1] http://crystal.uta.edu/~mcguigan/cse6350/papers/Bloom.pdf

On Sat, Oct 19, 2019 at 11:40 PM Claude Warren <claude@xenei.com> wrote:

> As an experiment I attempted to implement a complete BloomFilter (see
> BloomFilterI2 in https://issues.apache.org/jira/browse/COLLECTIONS-728)
> using only IntStream and merge( BloomFilter ).  It is possible, though
> probably not the fastest implementation.
>
> We could define something like:
>  ByteBuffer getBits()
>
> or a method that returns a stream of Boolean (this will be very long in
> processing).
> or a method that returns a stream of ByteBuffers (the concatenation of
> which is the entire set of bits)
> or a method that returns an IntStream (the concatenation of which is the
> entire set of bits).
>
> I think that most implementations will have some internal structure that
> can generate the above.  The wider the item in the stream the faster we can
> move the bits, the better the average performance will be.
>
>
> Claude
>
>
>
>
> On Sat, Oct 19, 2019 at 4:48 PM Gilles Sadowski <gilleseran@gmail.com>
> wrote:
>
>> Hello.
>>
>> 2019-10-19 16:20 UTC+02:00, Claude Warren <claude@xenei.com>:
>> > On Fri, Oct 18, 2019 at 3:22 PM Gilles Sadowski <gilleseran@gmail.com>
>> > wrote:
>> >
>> >> Hi.
>> >>
>> >> >>> [...]
>> >> > >
>> >> > > Maybe I was not clear enough: I'm not saying that we should prefer
>> >> > > some representation (of the state) over another; only that the
how
>> >> > > the state is represented externally need not be the same as the
>> >> internal
>> >> > > representation.
>> >> > >
>> >> > > But if the state is fully encapsulated by a "BitSet", then internal
>> >> > > and
>> >> > > external can be the same, making easier both for developers and
for
>> >> > > users.  So let's have indeed a "getBitSet()" that returns a copy
of
>> >> > > the
>> >> > > (private final) instance field (of type "BitSet") used to represent
>> >> > > the
>> >> > > state of a "BloomFilter".
>> >> > >
>> >> > >
>> >> > I agree with most of this statement. However, I think the use of
>> BitSet
>> >> for
>> >> > the internal representation is questionable.
>> >>
>> >> Hence, that settles the issue of "class" versus "interface":
>> >> "BloomFilter"
>> >> must be an interface.
>> >>
>> >> *I am still not convinced. How do you then provide consistent
>> >> implementations of the distance functions? We could use “default”
>> >> implementations but that seems like a misuse of the intent for
>> “default”,
>> >> though perhaps I misunderstand that.*
>>
>> What's the issue with having "distance(BloomFilter other)"
>> in the interface?
>>
>> >>
>> >>
>> >> [snip]
>> >
>> >> > In addition, there are researchers that do not use
>> >> > BitSet because it is too slow or too bloated.
>> >>
>> >> Important point.
>> >> [I thought that "BitSet" was evoked because it perfectly fitted the
>> >> bill.]
>> >>
>> >
>> > BitSet fits the bill in terms of required data read functionality.
>>
>> Yes; but not in terms of performance for certain use cases.
>> Hence why it cannot be a common denominator for all
>> implementations.
>>
>> >
>> > [snip]
>> >
>> >> > We can use the
>> >> > Implementation in our classes and use the interface where possible
>> but
>> >> > it
>> >> > leaves open the possibility of switching in a version of
>> RoaringBitmap
>> >> > or
>> >> > similar without changing the external API. It is this interface that
>> is
>> >> the
>> >> > “external reference implementation of state”. I think I would like
to
>> >> call
>> >> > this interface BitSetI and will refer to it this way later. I think
>> it
>> >> > should contain the following methods as defined in BitSet: and(),
>> >> andNot(),
>> >> > cardinality(), get(), intersects(), isEmpty(), nextClearBit(),
>> >> > nextSetBit(), or(), previousClearBit(), previousSetBit(), stream(),
>> >> xor().
>> >>
>> >> IMHO, this proposal orthogonal to the "BloomFilter" functionality, and
>> >> we must discuss separately whether [Collections] wants to host such an
>> >> API.
>> >>
>> >
>> > I think a core question needs to be answered: Is a Bloom filter an
>> > extension of functionality on a BitSet (isA) or does it simply use the
>> > functionality of a BitSet (hasA).
>>
>> AFAICT, neither.
>>
>> > If find myself of two minds on this
>> > quesiton at the start, but seem to come down on the isA side after
>> analysis
>> > of how the Bloom filter is used. To that end BloomFilter should provide
>> the
>> > data read methods of the BitSet functionality. What I called BitSetI
>> > before.
>>
>> "data read" and "BitSetI" are quite different.
>> As said before, the former is implementation-dependent and a
>> user's business (as long as the developers provide a suitable
>> accessor of course).
>>
>> >>
>> >> A priori, how the classes that implements the "BloomFilter" interface
>> >> represents the "set of bits" is an implementation detail.  Even that
>> >> there
>> >> is a "set of bits" is a detail to *users* of the interface.
>> >> Of course, a practical question is whether the representation should be
>> >> "pluggable".  If not, we could change to using one or another from
>> those
>> >> classes that you mentioned (and change it from version to version).
>> >> In any case, I think that this would be a feature to be left for later
>> >> (and
>> >> that could be added in a compatible way, if nothing of the "state" is
>> >> leaked
>> >> into the public API).
>> >>
>> >
>> > To my mind the “set of bits” interface is what I called BitSetI. Again,
>> > this defines how the Bloom filter presents itself as a “set of bits” not
>> > how it implements it.
>>
>> IMO, the "BitSetI"/"set of bits" is an implementation *detail*.
>> IOW: a "BloomFilter" is *not* a "set of bits" but *can* be represented
>> as one.
>>
>> Under the hood, it of course needs some kind of "set of bits" to perform
>> its operations (i.e. those methods that defines the public API), but that
>> certainly does not mean that an application developer would be well
>> advised to use a "BloomFilter" in order to represent, say, a number
>> (just because a number could also be viewed as a "set of bits")...
>>
>> >
>> > [snip]
>> >
>> >> > The implementation may not want to use the BitSet class because the
>> >> filter
>> >> > itself can be large. For example in DNA matching it is not uncommon
>> for
>> >> the
>> >> > Bloom filter to be 100s of MB in size. This also means we don’t want
>> to
>> >> > make copies if we don’t have to. (perhaps I should add DNA matching
>> to
>> >> the
>> >> > Use cases to capture the large number of bits).
>> >>
>> >> So this use case would entail that the "BloomFilter" interface must
>> *not*
>> >> mandate a "toBitSet()" method.
>> >>
>> >
>> > Agreed.
>> >
>> >
>> >> > > [...]
>> >> > >
>> >> >
>> >> > I agree with most of this statement. I still don’t like forcing a
>> >> standard
>> >> > BitSet implementation onto all instances of Bloom filter
>> >>
>> >> +1
>> >>
>> >> > when it is the
>> >> > logical construct (BitSetI) that we are looking for.
>> >>
>> >> -1
>> >> [IIUC; see my comments above.]
>> >>
>> >
>> > Somehow I think we are in violent agreement here, but perhaps are
>> talking
>> > past each other some how.
>>
>> Yes, and no.  We are indeed converging, but I more and more
>> convinced that the concept of "BloomFilter" should not be based
>> on how we think serialization should work or how to practically
>> implement the API.
>>
>> We must start from the "outside view" of the API, i.e. define what
>> services the "BloomFilter" must provide.
>>
>> >> > >
>> >> > > > There are several other examples such as layered Bloom
>> >> > > > filters (
>> >> > > https://en.wikipedia.org/wiki/Bloom_filter#Layered_Bloom_filters)
>> >> > > > and attenuated Bloom filters (
>> >> > > >
>> https://en.wikipedia.org/wiki/Bloom_filter#Attenuated_Bloom_filters)
>> >> > > where
>> >> > > > the internal structure deviates from the BitSet model but
that
>> >> > > > still
>> >> need
>> >> > > > to present themselves as AbstractBloomFilters do. I think
that
>> >> Abstract
>> >> > > > class is the correct pattern here.
>> >> > >
>> >> > > Not sure: "abstract" is adequate when several subclasses need
to
>> >> > > override behaviours; whereas, here the behaviour is the same but
>> >> > > the state is expanded.
>> >> > >
>> >> > >
>> >> > I think abstract because the methods to perform the calculations that
>> >> > are
>> >> > currently in BloomFilterFunctions would be implementable against
>> >> > BitSetI.
>> >> > Those are the functions that go into the abstract class (along with
>> >> BitSetI
>> >> > methods).
>> >>
>> >> Ditto: these are details from the POV of "BloomFilter" users; of
>> course,
>> >> we need code that will do the work on the whatever represents the "sets
>> >> of bits", but at first sight I don't think that we need to advertise a
>> >> "BitSetI"
>> >> (even a "protected" one) in an abstract class, because it's unlikely
>> that
>> >> any and all existing external implementations of "set of bits" will
>> >> comply
>> >> with an interface defined here.
>> >>
>> > Reading through various implementations of Bloom filters as well as
>> > papers
>> > describeing new types of Bloom filters and ways to index them and ways
>> to
>> > extract data from them there are a set of low level bit operations (xor,
>> > or, cardinality, and, cardinality of the results of xor, or and and)
>> that
>> > if supported make all that work possible. I will create a separate
>> document
>> > that outlines what those are so that we can discuss them in detail.
>>
>> You are certainly more aware than I am of all the "services" that
>> could be necessary in order to care for all the interesting use cases.
>> But how about focusing on the *minimal* API first?
>> What I suggest is to start from what you initially presented as the
>> "BloomFilter" interface ("match", "merge", etc.), and make that
>> work based on the layout ideas contained the preliminary Java code
>> posted on JIRA (issue COLLECTIONS-728).
>>
>> >
>> >  [snip]
>> >
>> >>
>> >> > I don’t know of any such usage, however, nesting Bloom filters is
>> done
>> >> > in
>> >> > Attenuated filters and Layered Bloom filters use a number of Bloom
>> >> > filter
>> >> > implementations internally.
>> >>
>> >> What do those use cases require from the "BloomFilter" interface?
>> >>
>> >
>> > What those implementations require are the standard match() and merge()
>> as
>> > well as some love level bit operations (see aboce). I will add this
>> > analysis to the separate document I committed to above.
>>
>> As per my preceding comment, I think that we should make the
>> basics as simple as simple, and show that it delivers the intended
>> functionality.
>> Then, more advanced use cases might well require addition to the
>> API, but it will be clear for everyone why the are needed (based on
>> said use-cases).
>> In effect, maybe some of the additional requirements could be
>> defined in an *extension* of the "BloomFilter" interface; a
>> "MutableBloomFilter" interface naturally comes to mind.
>> From what you said above, there might also be
>> ---CUT---
>> interface BloomFilterBitSetOperations extends BloomFilter {
>>     BloomFilterBitSetOperations and(BloomFilterBitSetOperations other);
>>     BloomFilterBitSetOperations xor(BloomFilterBitSetOperations other);
>> }
>> ---CUT---
>>
>> Some other functionalities (mostly the recurrent issue of serialization)
>> might be readily solved at the level of specific implementations; e.g.
>> the "BitSetBloomFilter" which I referred to before will certainly
>> provide "getBitSet()" and "fromBitSet(BitSet)" methods that are
>> sufficient for moving instances around.
>>
>> >> > > > [...]
>> >> > >
>> >> > Number of bits is not implicitly conveyed by getBitSet() as the
>> BitSet
>> >> will
>> >> > only grow to the size needed to capture the highest bit currently on.
>> >> > So
>> >> > the number of bits may be 72 but the highest on bit might be 50.
>> >>
>> >> OK; doesn't matter now since a common "BitSet" state has been
>> >> ruled out.
>> >> I get that "numberOfBits()" is one of the methods defined by
>> >> the "BloomFilter" interface.
>> >>
>> >> >
>> >> > Hash, number of functions (k), number of items (n), and number of
>> bits
>> >> (m)
>> >> > define the system in which the filter is expressed, call this the
>> >> “Shape”.
>> >> > It is not always necessary to convey the Shape when conveying the
>> bloom
>> >> > filter.
>> >>
>> >> Well, if it is *ever* necessary, then "BloomFilter" must declare a
>> >> "shape()"
>> >> method.  Hence, it could be worth defining a "Shape" subinterface (?).
>> >>
>> >>
>> > See other discussion thread on shape.
>>
>> I've used it to arrive to the "BloomFilter.java" proposal attached
>> on JIRA.
>>
>> >
>> > [snip]
>> >
>> >> >
>> >> > > I'm lost here; I don't know what is "the" reference state, since
a
>> >> > > single structure cannot for example represent both a standard
>> >> > > filter and a counting filter...
>> >> > >
>> >> > >
>> >> > Think BitSetI.
>> >>
>> >> Still: No.
>> >> IMHO. ;-)
>> >>
>> > I meant that the bit set functions defined by BitSetI define the state
>> of
>> > a standard Bloom filter. Since all Bloom filter types are and
>> > implementation of the concept of Bloom filter, all of them must be able
>> to
>> > provide the data requested from the BitSetI functionality.
>>
>> I get the rationale, but I'm not able to see that in the textual
>> description of use cases.
>>
>> Please note that I wish to avoid "preemptive" bloating of the API,
>> which, after the fact, is nearly impossible to fix in "Commons"
>> codes.
>> [Among other problems, this ever increasing size of the codebase,
>> despite the identified issues, is what led to "Commons Math"
>> unfortunate fate: specific case in point in the matrix functionality
>> there, whose bloated interface discouraged potential contributions
>> targeting at 3D geometry.]
>>
>> >> IIUC, for a Bloom filter,
>> >>     "state" = "shape" + "set of bits"
>> >>
>> >> *See separate shape discussion.*
>> >
>> >
>> >> > > >
>> >> > > >> > In addition, the abstract Bloom filter might implement
the
>> “data
>> >> > > >> > read” methods from BitSet directly. This would
allow more
>> >> efficient
>> >> > > >> > implementation of several methods that currently
depend upon
>> the
>> >> > > >> > creation
>> >> > > >> > of a BitSet and the manipulation of same.
>> >> > > >>
>> >> > > >> Sure; if any and all variants of a "BloomFilter" implementation
>> >> > > >> can
>> >> > > >> be created from a "BitSet", then such a factory method
could be
>> >> > > >> part of the "Builder" interface:
>> >> > > >>
>> >> > > >> public interface BloomFilter {
>> >> > > >>     // Core methods...
>> >> > > >>
>> >> > > >>     public interface Builder {
>> >> > > >>         public Builder from(BitSet s); // Factory method.
>> >>
>> >> As above, for large filters, this probably is not a good requirement.
>> >> Should it be
>> >>    public Builder from(BooleanStream s);
>> >> ?
>> >> ["BooleanStream" is not provided by the JDK...]
>> >>
>> >
>> > BooleanStream is not but IntStream is, in this case we define the
>> integer
>> > stream to be the indexes of the bits that are on in the filter.  I have
>> > seen this in some implementations.
>>
>> Maybe not the most efficient way to convey the state if more
>> than 1/32 of the bits are set.
>>
>> >> > > >>         public BloomFilter build();
>> >> > > >>         // Other builder methods.
>> >> > > >>     }
>> >> > > >> }
>> >> > > >>
>> >>
>> >
>> > [snip]
>> >
>> >
>> >> > One final issue. There was recently a post on the pull request asking
>> >> > why
>> >> > the Bloom filter is an immutable object. We should probably explore
>> why
>> >> or
>> >> > why not.
>> >>
>> >> Part of the answer is universal: because it is trivial to ensure a
>> >> consistent
>> >> state (even in the face of concurrent usage).
>> >> The other part of is the justification, which you gave on GitHub, based
>> >> on
>> >> expected usage.
>> >>
>> >> > The biggest argument against is that the merge() method would
>> requires
>> >> that
>> >> > the data be duplicated and it makes it hard to do in place updates.
>> For
>> >> > example when updating a “bucket” filter (as defined in the Use
>> cases).
>> >>
>> >> Now that "BloomFilter" is going to be a interface, some of its
>> >> implementations
>> >> could be mutable (e.g. if it would be a requirement for its target
>> >> usage).
>> >>
>> >> However, arguments about performance should be backed with
>> benchmarks...
>> >>
>> >> In general, I think that code will be more robust if low-level classes
>> >> are
>> >> immutable; and higher-level is responsible for managing instances (i.e.
>> >> replacing them if outdated).
>> >>
>> >> > But if it is not immutable then the hashCode value changes making it
>> >> > unsuitable for Hash sets. I think that there are classes in Commons
>> >> > that
>> >> > have similar restrictions, and I think that documenting the issue in
>> >> > the
>> >> > class javadoc as well as the hashCode() and equals() (assuming we
>> >> actually
>> >> > override them) would be acceptable.
>> >>
>> >> Are there use cases where Bloom filters would be stored in a "HashSet"?
>> >>
>> >>
>> > I had a use case in an index implementation but it can be worked around.
>> > Having the filter be mutable seem to be a common implementation. I am
>> > leaning in this direction now because immutability makes some usage more
>> > compex without a strong benefit.
>>
>> I'd tend to disagree, especially if the simple/basic implementations
>> have some uses in multi-threaded scenarios.
>> In particular, the copying of small objects will probably be more
>> efficient
>> than mutating them.
>>
>> > But I can go either way.
>>
>> With "BloomFilter" as an interface, we can go both ways: just add a
>> new implementation when a use-cases justifies it (through benchmarks).
>>
>> Best regards,
>> Gilles
>>
>> >
>> > All the best,
>> > Claude
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message