commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claude Warren <cla...@xenei.com>
Subject Re: [collections] BloomFilter package architecture discussion
Date Mon, 28 Oct 2019 17:26:36 GMT
Based on the feedback and questions I have created a different
implementation (https://github.com/Claudenw/BloomFilter/tree/Hasher) --
note this is the Hasher branch.

This implementation uses an abstract BloomFilter to implement the basic
functionality required and required that implementations implement 4
methods at a minimum:

getBits() -- retrieves a LongBuffer with the associated enabled bits turned
on.
getHasher() -- retrieves a StaticHasher that will iterate the index of the
enabled bits.
merge(BloomFilter) -- merges an abstract bloom filter into the current
bloom filter using one of the two methods above.
merge(Shape,Hasher) -- merges the bits from the hasher using the specified
shape into the filter.

Other methods are expected to be specialized for operation with Bloom
filters of the same type.

There are 4 Bloom filter implementations to demonstrate how this would be
done.  I do not expect the EWAH implementation to be included in an Apache
library as it depends on 3rd party code, it is here as an example.

BloomFilter class contains an enclosed  Shape class that describes the
shape of the BloomFilter.  This is very similar  to the Shape presented by
Gilles

There is a construct called a Hasher that comprises a hashName and a
ToLongBiFunction<ByteBuffer, Integer>.    The Hashing functions are moved
to DynamicHasher and result of hashing can be stored for reply later using
the StaticHasher.

There is a HasherFactory to build well known hashers.  The HasherFactory
also contains a description of how the ToLongBiFunction<ByteBuffer,
Integer> should function (i.e. what the parameters are) so that new
implementations can be created.  There are 4 implementations of hashing
functions included as examples.

Claude


On Sun, Oct 20, 2019 at 5:04 PM Gilles Sadowski <gilleseran@gmail.com>
wrote:

> Hi Claude.
>
> Le dim. 20 oct. 2019 à 14:20, Claude Warren <claude@xenei.com> a écrit :
> >
> > A Bloom filter is a set of bits:
>
> It is not, according to the quote here:
>
> > "The hash area is considered as N individual addressable bits, with
> > addresses 0 through N - 1. It is assumed that all bits in the hash area
> are
> > first set to 0. Next, each mes- sage in the set to be stored is hash
> coded
> > into a number of distinct bit addresses, say al, a2, ..., ad. Finally,
> all
> > d bits addressed by al through ad are set to 1." -- Burton Bloom,
> > "Space/Time Trade-offs in Hash Coding with Allowable Errors"[1]
>
> In effect, the functionality needs some structures (cf. "hash", "message",
> "addressable bits") to operate on, in order to provide its service (what I
> called the _user_ API).
> Either your contribution still confuses API with implementation, or I'm
> still totally missing what it tries to achieve (in terms of "service").
> My feeling is that we are stuck because of different programming habits.
> My proposal (to expand the contents of the "BloomFilter.java" file) should
> be seen as a practical exercise around the misunderstanding, that would
> hopefully lead us to a consensus.
>
> Regards,
> Gilles
>
> > Claude
> >
> > [1] http://crystall.uta.edu/~mcguigan/cse6350/papers/Bloom.pdf
> >
> > On Sat, Oct 19, 2019 at 11:40 PM Claude Warren <claude@xenei.com> wrote:
> >
> > > As an experiment I attempted to implement a complete BloomFilter (see
> > > BloomFilterI2 in https://issues.apache.org/jira/browse/COLLECTIONS-728
> )
> > > using only IntStream and merge( BloomFilter ).  It is possible, though
> > > probably not the fastest implementation.
> > >
> > > We could define something like:
> > >  ByteBuffer getBits()
> > >
> > > or a method that returns a stream of Boolean (this will be very long in
> > > processing).
> > > or a method that returns a stream of ByteBuffers (the concatenation of
> > > which is the entire set of bits)
> > > or a method that returns an IntStream (the concatenation of which is
> the
> > > entire set of bits).
> > >
> > > I think that most implementations will have some internal structure
> that
> > > can generate the above.  The wider the item in the stream the faster
> we can
> > > move the bits, the better the average performance will be.
> > >
> > >
> > > Claude
> > >
> > >
> > >
> > >
> > > On Sat, Oct 19, 2019 at 4:48 PM Gilles Sadowski <gilleseran@gmail.com>
> > > wrote:
> > >
> > >> Hello.
> > >>
> > >> 2019-10-19 16:20 UTC+02:00, Claude Warren <claude@xenei.com>:
> > >> > On Fri, Oct 18, 2019 at 3:22 PM Gilles Sadowski <
> gilleseran@gmail.com>
> > >> > wrote:
> > >> >
> > >> >> Hi.
> > >> >>
> > >> >> >>> [...]
> > >> >> > >
> > >> >> > > Maybe I was not clear enough: I'm not saying that we
should
> prefer
> > >> >> > > some representation (of the state) over another; only
that the
> how
> > >> >> > > the state is represented externally need not be the
same as the
> > >> >> internal
> > >> >> > > representation.
> > >> >> > >
> > >> >> > > But if the state is fully encapsulated by a "BitSet",
then
> internal
> > >> >> > > and
> > >> >> > > external can be the same, making easier both for developers
> and for
> > >> >> > > users.  So let's have indeed a "getBitSet()" that returns
a
> copy of
> > >> >> > > the
> > >> >> > > (private final) instance field (of type "BitSet") used
to
> represent
> > >> >> > > the
> > >> >> > > state of a "BloomFilter".
> > >> >> > >
> > >> >> > >
> > >> >> > I agree with most of this statement. However, I think the
use of
> > >> BitSet
> > >> >> for
> > >> >> > the internal representation is questionable.
> > >> >>
> > >> >> Hence, that settles the issue of "class" versus "interface":
> > >> >> "BloomFilter"
> > >> >> must be an interface.
> > >> >>
> > >> >> *I am still not convinced. How do you then provide consistent
> > >> >> implementations of the distance functions? We could use “default”
> > >> >> implementations but that seems like a misuse of the intent for
> > >> “default”,
> > >> >> though perhaps I misunderstand that.*
> > >>
> > >> What's the issue with having "distance(BloomFilter other)"
> > >> in the interface?
> > >>
> > >> >>
> > >> >>
> > >> >> [snip]
> > >> >
> > >> >> > In addition, there are researchers that do not use
> > >> >> > BitSet because it is too slow or too bloated.
> > >> >>
> > >> >> Important point.
> > >> >> [I thought that "BitSet" was evoked because it perfectly fitted
the
> > >> >> bill.]
> > >> >>
> > >> >
> > >> > BitSet fits the bill in terms of required data read functionality.
> > >>
> > >> Yes; but not in terms of performance for certain use cases.
> > >> Hence why it cannot be a common denominator for all
> > >> implementations.
> > >>
> > >> >
> > >> > [snip]
> > >> >
> > >> >> > We can use the
> > >> >> > Implementation in our classes and use the interface where
> possible
> > >> but
> > >> >> > it
> > >> >> > leaves open the possibility of switching in a version of
> > >> RoaringBitmap
> > >> >> > or
> > >> >> > similar without changing the external API. It is this interface
> that
> > >> is
> > >> >> the
> > >> >> > “external reference implementation of state”. I think
I would
> like to
> > >> >> call
> > >> >> > this interface BitSetI and will refer to it this way later.
I
> think
> > >> it
> > >> >> > should contain the following methods as defined in BitSet:
and(),
> > >> >> andNot(),
> > >> >> > cardinality(), get(), intersects(), isEmpty(), nextClearBit(),
> > >> >> > nextSetBit(), or(), previousClearBit(), previousSetBit(),
> stream(),
> > >> >> xor().
> > >> >>
> > >> >> IMHO, this proposal orthogonal to the "BloomFilter" functionality,
> and
> > >> >> we must discuss separately whether [Collections] wants to host
> such an
> > >> >> API.
> > >> >>
> > >> >
> > >> > I think a core question needs to be answered: Is a Bloom filter an
> > >> > extension of functionality on a BitSet (isA) or does it simply use
> the
> > >> > functionality of a BitSet (hasA).
> > >>
> > >> AFAICT, neither.
> > >>
> > >> > If find myself of two minds on this
> > >> > quesiton at the start, but seem to come down on the isA side after
> > >> analysis
> > >> > of how the Bloom filter is used. To that end BloomFilter should
> provide
> > >> the
> > >> > data read methods of the BitSet functionality. What I called BitSetI
> > >> > before.
> > >>
> > >> "data read" and "BitSetI" are quite different.
> > >> As said before, the former is implementation-dependent and a
> > >> user's business (as long as the developers provide a suitable
> > >> accessor of course).
> > >>
> > >> >>
> > >> >> A priori, how the classes that implements the "BloomFilter"
> interface
> > >> >> represents the "set of bits" is an implementation detail.  Even
> that
> > >> >> there
> > >> >> is a "set of bits" is a detail to *users* of the interface.
> > >> >> Of course, a practical question is whether the representation
> should be
> > >> >> "pluggable".  If not, we could change to using one or another
from
> > >> those
> > >> >> classes that you mentioned (and change it from version to version).
> > >> >> In any case, I think that this would be a feature to be left for
> later
> > >> >> (and
> > >> >> that could be added in a compatible way, if nothing of the "state"
> is
> > >> >> leaked
> > >> >> into the public API).
> > >> >>
> > >> >
> > >> > To my mind the “set of bits” interface is what I called BitSetI.
> Again,
> > >> > this defines how the Bloom filter presents itself as a “set of
> bits” not
> > >> > how it implements it.
> > >>
> > >> IMO, the "BitSetI"/"set of bits" is an implementation *detail*.
> > >> IOW: a "BloomFilter" is *not* a "set of bits" but *can* be represented
> > >> as one.
> > >>
> > >> Under the hood, it of course needs some kind of "set of bits" to
> perform
> > >> its operations (i.e. those methods that defines the public API), but
> that
> > >> certainly does not mean that an application developer would be well
> > >> advised to use a "BloomFilter" in order to represent, say, a number
> > >> (just because a number could also be viewed as a "set of bits")...
> > >>
> > >> >
> > >> > [snip]
> > >> >
> > >> >> > The implementation may not want to use the BitSet class because
> the
> > >> >> filter
> > >> >> > itself can be large. For example in DNA matching it is not
> uncommon
> > >> for
> > >> >> the
> > >> >> > Bloom filter to be 100s of MB in size. This also means we
don’t
> want
> > >> to
> > >> >> > make copies if we don’t have to. (perhaps I should add
DNA
> matching
> > >> to
> > >> >> the
> > >> >> > Use cases to capture the large number of bits).
> > >> >>
> > >> >> So this use case would entail that the "BloomFilter" interface
must
> > >> *not*
> > >> >> mandate a "toBitSet()" method.
> > >> >>
> > >> >
> > >> > Agreed.
> > >> >
> > >> >
> > >> >> > > [...]
> > >> >> > >
> > >> >> >
> > >> >> > I agree with most of this statement. I still don’t like
forcing a
> > >> >> standard
> > >> >> > BitSet implementation onto all instances of Bloom filter
> > >> >>
> > >> >> +1
> > >> >>
> > >> >> > when it is the
> > >> >> > logical construct (BitSetI) that we are looking for.
> > >> >>
> > >> >> -1
> > >> >> [IIUC; see my comments above.]
> > >> >>
> > >> >
> > >> > Somehow I think we are in violent agreement here, but perhaps are
> > >> talking
> > >> > past each other some how.
> > >>
> > >> Yes, and no.  We are indeed converging, but I more and more
> > >> convinced that the concept of "BloomFilter" should not be based
> > >> on how we think serialization should work or how to practically
> > >> implement the API.
> > >>
> > >> We must start from the "outside view" of the API, i.e. define what
> > >> services the "BloomFilter" must provide.
> > >>
> > >> >> > >
> > >> >> > > > There are several other examples such as layered
Bloom
> > >> >> > > > filters (
> > >> >> > >
> https://en.wikipedia.org/wiki/Bloom_filter#Layered_Bloom_filters)
> > >> >> > > > and attenuated Bloom filters (
> > >> >> > > >
> > >> https://en.wikipedia.org/wiki/Bloom_filter#Attenuated_Bloom_filters)
> > >> >> > > where
> > >> >> > > > the internal structure deviates from the BitSet
model but
> that
> > >> >> > > > still
> > >> >> need
> > >> >> > > > to present themselves as AbstractBloomFilters do.
I think
> that
> > >> >> Abstract
> > >> >> > > > class is the correct pattern here.
> > >> >> > >
> > >> >> > > Not sure: "abstract" is adequate when several subclasses
need
> to
> > >> >> > > override behaviours; whereas, here the behaviour is
the same
> but
> > >> >> > > the state is expanded.
> > >> >> > >
> > >> >> > >
> > >> >> > I think abstract because the methods to perform the calculations
> that
> > >> >> > are
> > >> >> > currently in BloomFilterFunctions would be implementable
against
> > >> >> > BitSetI.
> > >> >> > Those are the functions that go into the abstract class (along
> with
> > >> >> BitSetI
> > >> >> > methods).
> > >> >>
> > >> >> Ditto: these are details from the POV of "BloomFilter" users;
of
> > >> course,
> > >> >> we need code that will do the work on the whatever represents
the
> "sets
> > >> >> of bits", but at first sight I don't think that we need to
> advertise a
> > >> >> "BitSetI"
> > >> >> (even a "protected" one) in an abstract class, because it's
> unlikely
> > >> that
> > >> >> any and all existing external implementations of "set of bits"
will
> > >> >> comply
> > >> >> with an interface defined here.
> > >> >>
> > >> > Reading through various implementations of Bloom filters as well as
> > >> > papers
> > >> > describeing new types of Bloom filters and ways to index them and
> ways
> > >> to
> > >> > extract data from them there are a set of low level bit operations
> (xor,
> > >> > or, cardinality, and, cardinality of the results of xor, or and and)
> > >> that
> > >> > if supported make all that work possible. I will create a separate
> > >> document
> > >> > that outlines what those are so that we can discuss them in detail.
> > >>
> > >> You are certainly more aware than I am of all the "services" that
> > >> could be necessary in order to care for all the interesting use cases.
> > >> But how about focusing on the *minimal* API first?
> > >> What I suggest is to start from what you initially presented as the
> > >> "BloomFilter" interface ("match", "merge", etc.), and make that
> > >> work based on the layout ideas contained the preliminary Java code
> > >> posted on JIRA (issue COLLECTIONS-728).
> > >>
> > >> >
> > >> >  [snip]
> > >> >
> > >> >>
> > >> >> > I don’t know of any such usage, however, nesting Bloom
filters is
> > >> done
> > >> >> > in
> > >> >> > Attenuated filters and Layered Bloom filters use a number
of
> Bloom
> > >> >> > filter
> > >> >> > implementations internally.
> > >> >>
> > >> >> What do those use cases require from the "BloomFilter" interface?
> > >> >>
> > >> >
> > >> > What those implementations require are the standard match() and
> merge()
> > >> as
> > >> > well as some love level bit operations (see aboce). I will add this
> > >> > analysis to the separate document I committed to above.
> > >>
> > >> As per my preceding comment, I think that we should make the
> > >> basics as simple as simple, and show that it delivers the intended
> > >> functionality.
> > >> Then, more advanced use cases might well require addition to the
> > >> API, but it will be clear for everyone why the are needed (based on
> > >> said use-cases).
> > >> In effect, maybe some of the additional requirements could be
> > >> defined in an *extension* of the "BloomFilter" interface; a
> > >> "MutableBloomFilter" interface naturally comes to mind.
> > >> From what you said above, there might also be
> > >> ---CUT---
> > >> interface BloomFilterBitSetOperations extends BloomFilter {
> > >>     BloomFilterBitSetOperations and(BloomFilterBitSetOperations
> other);
> > >>     BloomFilterBitSetOperations xor(BloomFilterBitSetOperations
> other);
> > >> }
> > >> ---CUT---
> > >>
> > >> Some other functionalities (mostly the recurrent issue of
> serialization)
> > >> might be readily solved at the level of specific implementations; e.g.
> > >> the "BitSetBloomFilter" which I referred to before will certainly
> > >> provide "getBitSet()" and "fromBitSet(BitSet)" methods that are
> > >> sufficient for moving instances around.
> > >>
> > >> >> > > > [...]
> > >> >> > >
> > >> >> > Number of bits is not implicitly conveyed by getBitSet()
as the
> > >> BitSet
> > >> >> will
> > >> >> > only grow to the size needed to capture the highest bit
> currently on.
> > >> >> > So
> > >> >> > the number of bits may be 72 but the highest on bit might
be 50.
> > >> >>
> > >> >> OK; doesn't matter now since a common "BitSet" state has been
> > >> >> ruled out.
> > >> >> I get that "numberOfBits()" is one of the methods defined by
> > >> >> the "BloomFilter" interface.
> > >> >>
> > >> >> >
> > >> >> > Hash, number of functions (k), number of items (n), and number
of
> > >> bits
> > >> >> (m)
> > >> >> > define the system in which the filter is expressed, call
this the
> > >> >> “Shape”.
> > >> >> > It is not always necessary to convey the Shape when conveying
the
> > >> bloom
> > >> >> > filter.
> > >> >>
> > >> >> Well, if it is *ever* necessary, then "BloomFilter" must declare
a
> > >> >> "shape()"
> > >> >> method.  Hence, it could be worth defining a "Shape" subinterface
> (?).
> > >> >>
> > >> >>
> > >> > See other discussion thread on shape.
> > >>
> > >> I've used it to arrive to the "BloomFilter.java" proposal attached
> > >> on JIRA.
> > >>
> > >> >
> > >> > [snip]
> > >> >
> > >> >> >
> > >> >> > > I'm lost here; I don't know what is "the" reference
state,
> since a
> > >> >> > > single structure cannot for example represent both a
standard
> > >> >> > > filter and a counting filter...
> > >> >> > >
> > >> >> > >
> > >> >> > Think BitSetI.
> > >> >>
> > >> >> Still: No.
> > >> >> IMHO. ;-)
> > >> >>
> > >> > I meant that the bit set functions defined by BitSetI define the
> state
> > >> of
> > >> > a standard Bloom filter. Since all Bloom filter types are and
> > >> > implementation of the concept of Bloom filter, all of them must be
> able
> > >> to
> > >> > provide the data requested from the BitSetI functionality.
> > >>
> > >> I get the rationale, but I'm not able to see that in the textual
> > >> description of use cases.
> > >>
> > >> Please note that I wish to avoid "preemptive" bloating of the API,
> > >> which, after the fact, is nearly impossible to fix in "Commons"
> > >> codes.
> > >> [Among other problems, this ever increasing size of the codebase,
> > >> despite the identified issues, is what led to "Commons Math"
> > >> unfortunate fate: specific case in point in the matrix functionality
> > >> there, whose bloated interface discouraged potential contributions
> > >> targeting at 3D geometry.]
> > >>
> > >> >> IIUC, for a Bloom filter,
> > >> >>     "state" = "shape" + "set of bits"
> > >> >>
> > >> >> *See separate shape discussion.*
> > >> >
> > >> >
> > >> >> > > >
> > >> >> > > >> > In addition, the abstract Bloom filter
might implement the
> > >> “data
> > >> >> > > >> > read” methods from BitSet directly.
This would allow more
> > >> >> efficient
> > >> >> > > >> > implementation of several methods that
currently depend
> upon
> > >> the
> > >> >> > > >> > creation
> > >> >> > > >> > of a BitSet and the manipulation of same.
> > >> >> > > >>
> > >> >> > > >> Sure; if any and all variants of a "BloomFilter"
> implementation
> > >> >> > > >> can
> > >> >> > > >> be created from a "BitSet", then such a factory
method
> could be
> > >> >> > > >> part of the "Builder" interface:
> > >> >> > > >>
> > >> >> > > >> public interface BloomFilter {
> > >> >> > > >>     // Core methods...
> > >> >> > > >>
> > >> >> > > >>     public interface Builder {
> > >> >> > > >>         public Builder from(BitSet s); // Factory
method.
> > >> >>
> > >> >> As above, for large filters, this probably is not a good
> requirement.
> > >> >> Should it be
> > >> >>    public Builder from(BooleanStream s);
> > >> >> ?
> > >> >> ["BooleanStream" is not provided by the JDK...]
> > >> >>
> > >> >
> > >> > BooleanStream is not but IntStream is, in this case we define the
> > >> integer
> > >> > stream to be the indexes of the bits that are on in the filter.  I
> have
> > >> > seen this in some implementations.
> > >>
> > >> Maybe not the most efficient way to convey the state if more
> > >> than 1/32 of the bits are set.
> > >>
> > >> >> > > >>         public BloomFilter build();
> > >> >> > > >>         // Other builder methods.
> > >> >> > > >>     }
> > >> >> > > >> }
> > >> >> > > >>
> > >> >>
> > >> >
> > >> > [snip]
> > >> >
> > >> >
> > >> >> > One final issue. There was recently a post on the pull request
> asking
> > >> >> > why
> > >> >> > the Bloom filter is an immutable object. We should probably
> explore
> > >> why
> > >> >> or
> > >> >> > why not.
> > >> >>
> > >> >> Part of the answer is universal: because it is trivial to ensure
a
> > >> >> consistent
> > >> >> state (even in the face of concurrent usage).
> > >> >> The other part of is the justification, which you gave on GitHub,
> based
> > >> >> on
> > >> >> expected usage.
> > >> >>
> > >> >> > The biggest argument against is that the merge() method would
> > >> requires
> > >> >> that
> > >> >> > the data be duplicated and it makes it hard to do in place
> updates.
> > >> For
> > >> >> > example when updating a “bucket” filter (as defined in
the Use
> > >> cases).
> > >> >>
> > >> >> Now that "BloomFilter" is going to be a interface, some of its
> > >> >> implementations
> > >> >> could be mutable (e.g. if it would be a requirement for its target
> > >> >> usage).
> > >> >>
> > >> >> However, arguments about performance should be backed with
> > >> benchmarks...
> > >> >>
> > >> >> In general, I think that code will be more robust if low-level
> classes
> > >> >> are
> > >> >> immutable; and higher-level is responsible for managing instances
> (i.e.
> > >> >> replacing them if outdated).
> > >> >>
> > >> >> > But if it is not immutable then the hashCode value changes
> making it
> > >> >> > unsuitable for Hash sets. I think that there are classes
in
> Commons
> > >> >> > that
> > >> >> > have similar restrictions, and I think that documenting the
> issue in
> > >> >> > the
> > >> >> > class javadoc as well as the hashCode() and equals() (assuming
we
> > >> >> actually
> > >> >> > override them) would be acceptable.
> > >> >>
> > >> >> Are there use cases where Bloom filters would be stored in a
> "HashSet"?
> > >> >>
> > >> >>
> > >> > I had a use case in an index implementation but it can be worked
> around.
> > >> > Having the filter be mutable seem to be a common implementation. I
> am
> > >> > leaning in this direction now because immutability makes some usage
> more
> > >> > compex without a strong benefit.
> > >>
> > >> I'd tend to disagree, especially if the simple/basic implementations
> > >> have some uses in multi-threaded scenarios.
> > >> In particular, the copying of small objects will probably be more
> > >> efficient
> > >> than mutating them.
> > >>
> > >> > But I can go either way.
> > >>
> > >> With "BloomFilter" as an interface, we can go both ways: just add a
> > >> new implementation when a use-cases justifies it (through benchmarks).
> > >>
> > >> Best regards,
> > >> Gilles
> > >>
> > >> >
> > >> > All the best,
> > >> > Claude
> > >> >
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> > >> For additional commands, e-mail: dev-help@commons.apache.org
> > >>
> > >>
> > >
> > > --
> > > I like: Like Like - The likeliest place on the web
> > > <http://like-like.xenei.com>
> > > LinkedIn: http://www.linkedin.com/in/claudewarren
> > >
> >
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com>
> > LinkedIn: http://www.linkedin.com/in/claudewarren
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message