commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claude Warren <cla...@xenei.com>
Subject Re: [collections] BloomFilter package architecture discussion
Date Sat, 19 Oct 2019 22:40:10 GMT
As an experiment I attempted to implement a complete BloomFilter (see
BloomFilterI2 in https://issues.apache.org/jira/browse/COLLECTIONS-728)
using only IntStream and merge( BloomFilter ).  It is possible, though
probably not the fastest implementation.

We could define something like:
 ByteBuffer getBits()

or a method that returns a stream of Boolean (this will be very long in
processing).
or a method that returns a stream of ByteBuffers (the concatenation of
which is the entire set of bits)
or a method that returns an IntStream (the concatenation of which is the
entire set of bits).

I think that most implementations will have some internal structure that
can generate the above.  The wider the item in the stream the faster we can
move the bits, the better the average performance will be.


Claude




On Sat, Oct 19, 2019 at 4:48 PM Gilles Sadowski <gilleseran@gmail.com>
wrote:

> Hello.
>
> 2019-10-19 16:20 UTC+02:00, Claude Warren <claude@xenei.com>:
> > On Fri, Oct 18, 2019 at 3:22 PM Gilles Sadowski <gilleseran@gmail.com>
> > wrote:
> >
> >> Hi.
> >>
> >> >>> [...]
> >> > >
> >> > > Maybe I was not clear enough: I'm not saying that we should prefer
> >> > > some representation (of the state) over another; only that the how
> >> > > the state is represented externally need not be the same as the
> >> internal
> >> > > representation.
> >> > >
> >> > > But if the state is fully encapsulated by a "BitSet", then internal
> >> > > and
> >> > > external can be the same, making easier both for developers and for
> >> > > users.  So let's have indeed a "getBitSet()" that returns a copy of
> >> > > the
> >> > > (private final) instance field (of type "BitSet") used to represent
> >> > > the
> >> > > state of a "BloomFilter".
> >> > >
> >> > >
> >> > I agree with most of this statement. However, I think the use of
> BitSet
> >> for
> >> > the internal representation is questionable.
> >>
> >> Hence, that settles the issue of "class" versus "interface":
> >> "BloomFilter"
> >> must be an interface.
> >>
> >> *I am still not convinced. How do you then provide consistent
> >> implementations of the distance functions? We could use “default”
> >> implementations but that seems like a misuse of the intent for
> “default”,
> >> though perhaps I misunderstand that.*
>
> What's the issue with having "distance(BloomFilter other)"
> in the interface?
>
> >>
> >>
> >> [snip]
> >
> >> > In addition, there are researchers that do not use
> >> > BitSet because it is too slow or too bloated.
> >>
> >> Important point.
> >> [I thought that "BitSet" was evoked because it perfectly fitted the
> >> bill.]
> >>
> >
> > BitSet fits the bill in terms of required data read functionality.
>
> Yes; but not in terms of performance for certain use cases.
> Hence why it cannot be a common denominator for all
> implementations.
>
> >
> > [snip]
> >
> >> > We can use the
> >> > Implementation in our classes and use the interface where possible but
> >> > it
> >> > leaves open the possibility of switching in a version of RoaringBitmap
> >> > or
> >> > similar without changing the external API. It is this interface that
> is
> >> the
> >> > “external reference implementation of state”. I think I would like
to
> >> call
> >> > this interface BitSetI and will refer to it this way later. I think it
> >> > should contain the following methods as defined in BitSet: and(),
> >> andNot(),
> >> > cardinality(), get(), intersects(), isEmpty(), nextClearBit(),
> >> > nextSetBit(), or(), previousClearBit(), previousSetBit(), stream(),
> >> xor().
> >>
> >> IMHO, this proposal orthogonal to the "BloomFilter" functionality, and
> >> we must discuss separately whether [Collections] wants to host such an
> >> API.
> >>
> >
> > I think a core question needs to be answered: Is a Bloom filter an
> > extension of functionality on a BitSet (isA) or does it simply use the
> > functionality of a BitSet (hasA).
>
> AFAICT, neither.
>
> > If find myself of two minds on this
> > quesiton at the start, but seem to come down on the isA side after
> analysis
> > of how the Bloom filter is used. To that end BloomFilter should provide
> the
> > data read methods of the BitSet functionality. What I called BitSetI
> > before.
>
> "data read" and "BitSetI" are quite different.
> As said before, the former is implementation-dependent and a
> user's business (as long as the developers provide a suitable
> accessor of course).
>
> >>
> >> A priori, how the classes that implements the "BloomFilter" interface
> >> represents the "set of bits" is an implementation detail.  Even that
> >> there
> >> is a "set of bits" is a detail to *users* of the interface.
> >> Of course, a practical question is whether the representation should be
> >> "pluggable".  If not, we could change to using one or another from those
> >> classes that you mentioned (and change it from version to version).
> >> In any case, I think that this would be a feature to be left for later
> >> (and
> >> that could be added in a compatible way, if nothing of the "state" is
> >> leaked
> >> into the public API).
> >>
> >
> > To my mind the “set of bits” interface is what I called BitSetI. Again,
> > this defines how the Bloom filter presents itself as a “set of bits” not
> > how it implements it.
>
> IMO, the "BitSetI"/"set of bits" is an implementation *detail*.
> IOW: a "BloomFilter" is *not* a "set of bits" but *can* be represented
> as one.
>
> Under the hood, it of course needs some kind of "set of bits" to perform
> its operations (i.e. those methods that defines the public API), but that
> certainly does not mean that an application developer would be well
> advised to use a "BloomFilter" in order to represent, say, a number
> (just because a number could also be viewed as a "set of bits")...
>
> >
> > [snip]
> >
> >> > The implementation may not want to use the BitSet class because the
> >> filter
> >> > itself can be large. For example in DNA matching it is not uncommon
> for
> >> the
> >> > Bloom filter to be 100s of MB in size. This also means we don’t want
> to
> >> > make copies if we don’t have to. (perhaps I should add DNA matching to
> >> the
> >> > Use cases to capture the large number of bits).
> >>
> >> So this use case would entail that the "BloomFilter" interface must
> *not*
> >> mandate a "toBitSet()" method.
> >>
> >
> > Agreed.
> >
> >
> >> > > [...]
> >> > >
> >> >
> >> > I agree with most of this statement. I still don’t like forcing a
> >> standard
> >> > BitSet implementation onto all instances of Bloom filter
> >>
> >> +1
> >>
> >> > when it is the
> >> > logical construct (BitSetI) that we are looking for.
> >>
> >> -1
> >> [IIUC; see my comments above.]
> >>
> >
> > Somehow I think we are in violent agreement here, but perhaps are talking
> > past each other some how.
>
> Yes, and no.  We are indeed converging, but I more and more
> convinced that the concept of "BloomFilter" should not be based
> on how we think serialization should work or how to practically
> implement the API.
>
> We must start from the "outside view" of the API, i.e. define what
> services the "BloomFilter" must provide.
>
> >> > >
> >> > > > There are several other examples such as layered Bloom
> >> > > > filters (
> >> > > https://en.wikipedia.org/wiki/Bloom_filter#Layered_Bloom_filters)
> >> > > > and attenuated Bloom filters (
> >> > > >
> https://en.wikipedia.org/wiki/Bloom_filter#Attenuated_Bloom_filters)
> >> > > where
> >> > > > the internal structure deviates from the BitSet model but that
> >> > > > still
> >> need
> >> > > > to present themselves as AbstractBloomFilters do. I think that
> >> Abstract
> >> > > > class is the correct pattern here.
> >> > >
> >> > > Not sure: "abstract" is adequate when several subclasses need to
> >> > > override behaviours; whereas, here the behaviour is the same but
> >> > > the state is expanded.
> >> > >
> >> > >
> >> > I think abstract because the methods to perform the calculations that
> >> > are
> >> > currently in BloomFilterFunctions would be implementable against
> >> > BitSetI.
> >> > Those are the functions that go into the abstract class (along with
> >> BitSetI
> >> > methods).
> >>
> >> Ditto: these are details from the POV of "BloomFilter" users; of course,
> >> we need code that will do the work on the whatever represents the "sets
> >> of bits", but at first sight I don't think that we need to advertise a
> >> "BitSetI"
> >> (even a "protected" one) in an abstract class, because it's unlikely
> that
> >> any and all existing external implementations of "set of bits" will
> >> comply
> >> with an interface defined here.
> >>
> > Reading through various implementations of Bloom filters as well as
> > papers
> > describeing new types of Bloom filters and ways to index them and ways to
> > extract data from them there are a set of low level bit operations (xor,
> > or, cardinality, and, cardinality of the results of xor, or and and) that
> > if supported make all that work possible. I will create a separate
> document
> > that outlines what those are so that we can discuss them in detail.
>
> You are certainly more aware than I am of all the "services" that
> could be necessary in order to care for all the interesting use cases.
> But how about focusing on the *minimal* API first?
> What I suggest is to start from what you initially presented as the
> "BloomFilter" interface ("match", "merge", etc.), and make that
> work based on the layout ideas contained the preliminary Java code
> posted on JIRA (issue COLLECTIONS-728).
>
> >
> >  [snip]
> >
> >>
> >> > I don’t know of any such usage, however, nesting Bloom filters is done
> >> > in
> >> > Attenuated filters and Layered Bloom filters use a number of Bloom
> >> > filter
> >> > implementations internally.
> >>
> >> What do those use cases require from the "BloomFilter" interface?
> >>
> >
> > What those implementations require are the standard match() and merge()
> as
> > well as some love level bit operations (see aboce). I will add this
> > analysis to the separate document I committed to above.
>
> As per my preceding comment, I think that we should make the
> basics as simple as simple, and show that it delivers the intended
> functionality.
> Then, more advanced use cases might well require addition to the
> API, but it will be clear for everyone why the are needed (based on
> said use-cases).
> In effect, maybe some of the additional requirements could be
> defined in an *extension* of the "BloomFilter" interface; a
> "MutableBloomFilter" interface naturally comes to mind.
> From what you said above, there might also be
> ---CUT---
> interface BloomFilterBitSetOperations extends BloomFilter {
>     BloomFilterBitSetOperations and(BloomFilterBitSetOperations other);
>     BloomFilterBitSetOperations xor(BloomFilterBitSetOperations other);
> }
> ---CUT---
>
> Some other functionalities (mostly the recurrent issue of serialization)
> might be readily solved at the level of specific implementations; e.g.
> the "BitSetBloomFilter" which I referred to before will certainly
> provide "getBitSet()" and "fromBitSet(BitSet)" methods that are
> sufficient for moving instances around.
>
> >> > > > [...]
> >> > >
> >> > Number of bits is not implicitly conveyed by getBitSet() as the BitSet
> >> will
> >> > only grow to the size needed to capture the highest bit currently on.
> >> > So
> >> > the number of bits may be 72 but the highest on bit might be 50.
> >>
> >> OK; doesn't matter now since a common "BitSet" state has been
> >> ruled out.
> >> I get that "numberOfBits()" is one of the methods defined by
> >> the "BloomFilter" interface.
> >>
> >> >
> >> > Hash, number of functions (k), number of items (n), and number of bits
> >> (m)
> >> > define the system in which the filter is expressed, call this the
> >> “Shape”.
> >> > It is not always necessary to convey the Shape when conveying the
> bloom
> >> > filter.
> >>
> >> Well, if it is *ever* necessary, then "BloomFilter" must declare a
> >> "shape()"
> >> method.  Hence, it could be worth defining a "Shape" subinterface (?).
> >>
> >>
> > See other discussion thread on shape.
>
> I've used it to arrive to the "BloomFilter.java" proposal attached
> on JIRA.
>
> >
> > [snip]
> >
> >> >
> >> > > I'm lost here; I don't know what is "the" reference state, since a
> >> > > single structure cannot for example represent both a standard
> >> > > filter and a counting filter...
> >> > >
> >> > >
> >> > Think BitSetI.
> >>
> >> Still: No.
> >> IMHO. ;-)
> >>
> > I meant that the bit set functions defined by BitSetI define the state of
> > a standard Bloom filter. Since all Bloom filter types are and
> > implementation of the concept of Bloom filter, all of them must be able
> to
> > provide the data requested from the BitSetI functionality.
>
> I get the rationale, but I'm not able to see that in the textual
> description of use cases.
>
> Please note that I wish to avoid "preemptive" bloating of the API,
> which, after the fact, is nearly impossible to fix in "Commons"
> codes.
> [Among other problems, this ever increasing size of the codebase,
> despite the identified issues, is what led to "Commons Math"
> unfortunate fate: specific case in point in the matrix functionality
> there, whose bloated interface discouraged potential contributions
> targeting at 3D geometry.]
>
> >> IIUC, for a Bloom filter,
> >>     "state" = "shape" + "set of bits"
> >>
> >> *See separate shape discussion.*
> >
> >
> >> > > >
> >> > > >> > In addition, the abstract Bloom filter might implement
the
> “data
> >> > > >> > read” methods from BitSet directly. This would allow
more
> >> efficient
> >> > > >> > implementation of several methods that currently depend
upon
> the
> >> > > >> > creation
> >> > > >> > of a BitSet and the manipulation of same.
> >> > > >>
> >> > > >> Sure; if any and all variants of a "BloomFilter" implementation
> >> > > >> can
> >> > > >> be created from a "BitSet", then such a factory method could
be
> >> > > >> part of the "Builder" interface:
> >> > > >>
> >> > > >> public interface BloomFilter {
> >> > > >>     // Core methods...
> >> > > >>
> >> > > >>     public interface Builder {
> >> > > >>         public Builder from(BitSet s); // Factory method.
> >>
> >> As above, for large filters, this probably is not a good requirement.
> >> Should it be
> >>    public Builder from(BooleanStream s);
> >> ?
> >> ["BooleanStream" is not provided by the JDK...]
> >>
> >
> > BooleanStream is not but IntStream is, in this case we define the integer
> > stream to be the indexes of the bits that are on in the filter.  I have
> > seen this in some implementations.
>
> Maybe not the most efficient way to convey the state if more
> than 1/32 of the bits are set.
>
> >> > > >>         public BloomFilter build();
> >> > > >>         // Other builder methods.
> >> > > >>     }
> >> > > >> }
> >> > > >>
> >>
> >
> > [snip]
> >
> >
> >> > One final issue. There was recently a post on the pull request asking
> >> > why
> >> > the Bloom filter is an immutable object. We should probably explore
> why
> >> or
> >> > why not.
> >>
> >> Part of the answer is universal: because it is trivial to ensure a
> >> consistent
> >> state (even in the face of concurrent usage).
> >> The other part of is the justification, which you gave on GitHub, based
> >> on
> >> expected usage.
> >>
> >> > The biggest argument against is that the merge() method would requires
> >> that
> >> > the data be duplicated and it makes it hard to do in place updates.
> For
> >> > example when updating a “bucket” filter (as defined in the Use cases).
> >>
> >> Now that "BloomFilter" is going to be a interface, some of its
> >> implementations
> >> could be mutable (e.g. if it would be a requirement for its target
> >> usage).
> >>
> >> However, arguments about performance should be backed with benchmarks...
> >>
> >> In general, I think that code will be more robust if low-level classes
> >> are
> >> immutable; and higher-level is responsible for managing instances (i.e.
> >> replacing them if outdated).
> >>
> >> > But if it is not immutable then the hashCode value changes making it
> >> > unsuitable for Hash sets. I think that there are classes in Commons
> >> > that
> >> > have similar restrictions, and I think that documenting the issue in
> >> > the
> >> > class javadoc as well as the hashCode() and equals() (assuming we
> >> actually
> >> > override them) would be acceptable.
> >>
> >> Are there use cases where Bloom filters would be stored in a "HashSet"?
> >>
> >>
> > I had a use case in an index implementation but it can be worked around.
> > Having the filter be mutable seem to be a common implementation. I am
> > leaning in this direction now because immutability makes some usage more
> > compex without a strong benefit.
>
> I'd tend to disagree, especially if the simple/basic implementations
> have some uses in multi-threaded scenarios.
> In particular, the copying of small objects will probably be more efficient
> than mutating them.
>
> > But I can go either way.
>
> With "BloomFilter" as an interface, we can go both ways: just add a
> new implementation when a use-cases justifies it (through benchmarks).
>
> Best regards,
> Gilles
>
> >
> > All the best,
> > Claude
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message