commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claude Warren <cla...@xenei.com>
Subject Re: [collections] BloomFilter package architecture discussion
Date Sat, 19 Oct 2019 14:20:12 GMT
On Fri, Oct 18, 2019 at 3:22 PM Gilles Sadowski <gilleseran@gmail.com>
wrote:

> Hi.
>
> >>> [...]
> > >
> > > Maybe I was not clear enough: I'm not saying that we should prefer
> > > some representation (of the state) over another; only that the how
> > > the state is represented externally need not be the same as the
> internal
> > > representation.
> > >
> > > But if the state is fully encapsulated by a "BitSet", then internal and
> > > external can be the same, making easier both for developers and for
> > > users.  So let's have indeed a "getBitSet()" that returns a copy of the
> > > (private final) instance field (of type "BitSet") used to represent the
> > > state of a "BloomFilter".
> > >
> > >
> > I agree with most of this statement. However, I think the use of BitSet
> for
> > the internal representation is questionable.
>
> Hence, that settles the issue of "class" versus "interface": "BloomFilter"
> must be an interface.
>
> *I am still not convinced. How do you then provide consistent
> implementations of the distance functions? We could use “default”
> implementations but that seems like a misuse of the intent for “default”,
> though perhaps I misunderstand that.*
>
>
> [snip]

> > In addition, there are researchers that do not use
> > BitSet because it is too slow or too bloated.
>
> Important point.
> [I thought that "BitSet" was evoked because it perfectly fitted the bill.]
>

BitSet fits the bill in terms of required data read functionality.

[snip]

> > We can use the
> > Implementation in our classes and use the interface where possible but it
> > leaves open the possibility of switching in a version of RoaringBitmap or
> > similar without changing the external API. It is this interface that is
> the
> > “external reference implementation of state”. I think I would like to
> call
> > this interface BitSetI and will refer to it this way later. I think it
> > should contain the following methods as defined in BitSet: and(),
> andNot(),
> > cardinality(), get(), intersects(), isEmpty(), nextClearBit(),
> > nextSetBit(), or(), previousClearBit(), previousSetBit(), stream(),
> xor().
>
> IMHO, this proposal orthogonal to the "BloomFilter" functionality, and
> we must discuss separately whether [Collections] wants to host such an
> API.
>

I think a core question needs to be answered: Is a Bloom filter an
extension of functionality on a BitSet (isA) or does it simply use the
functionality of a BitSet (hasA). If find myself of two minds on this
quesiton at the start, but seem to come down on the isA side after analysis
of how the Bloom filter is used. To that end BloomFilter should provide the
data read methods of the BitSet functionality. What I called BitSetI
before.

>
> A priori, how the classes that implements the "BloomFilter" interface
> represents the "set of bits" is an implementation detail.  Even that there
> is a "set of bits" is a detail to *users* of the interface.
> Of course, a practical question is whether the representation should be
> "pluggable".  If not, we could change to using one or another from those
> classes that you mentioned (and change it from version to version).
> In any case, I think that this would be a feature to be left for later (and
> that could be added in a compatible way, if nothing of the "state" is
> leaked
> into the public API).
>

To my mind the “set of bits” interface is what I called BitSetI. Again,
this defines how the Bloom filter presents itself as a “set of bits” not
how it implements it.

[snip]

> > The implementation may not want to use the BitSet class because the
> filter
> > itself can be large. For example in DNA matching it is not uncommon for
> the
> > Bloom filter to be 100s of MB in size. This also means we don’t want to
> > make copies if we don’t have to. (perhaps I should add DNA matching to
> the
> > Use cases to capture the large number of bits).
>
> So this use case would entail that the "BloomFilter" interface must *not*
> mandate a "toBitSet()" method.
>

Agreed.


> > > [...]
> > >
> >
> > I agree with most of this statement. I still don’t like forcing a
> standard
> > BitSet implementation onto all instances of Bloom filter
>
> +1
>
> > when it is the
> > logical construct (BitSetI) that we are looking for.
>
> -1
> [IIUC; see my comments above.]
>

Somehow I think we are in violent agreement here, but perhaps are talking
past each other some how.


> > >
> > > > There are several other examples such as layered Bloom
> > > > filters (
> > > https://en.wikipedia.org/wiki/Bloom_filter#Layered_Bloom_filters)
> > > > and attenuated Bloom filters (
> > > > https://en.wikipedia.org/wiki/Bloom_filter#Attenuated_Bloom_filters)
> > > where
> > > > the internal structure deviates from the BitSet model but that still
> need
> > > > to present themselves as AbstractBloomFilters do. I think that
> Abstract
> > > > class is the correct pattern here.
> > >
> > > Not sure: "abstract" is adequate when several subclasses need to
> > > override behaviours; whereas, here the behaviour is the same but
> > > the state is expanded.
> > >
> > >
> > I think abstract because the methods to perform the calculations that are
> > currently in BloomFilterFunctions would be implementable against BitSetI.
> > Those are the functions that go into the abstract class (along with
> BitSetI
> > methods).
>
> Ditto: these are details from the POV of "BloomFilter" users; of course,
> we need code that will do the work on the whatever represents the "sets
> of bits", but at first sight I don't think that we need to advertise a
> "BitSetI"
> (even a "protected" one) in an abstract class, because it's unlikely that
> any and all existing external implementations of "set of bits" will comply
> with an interface defined here.
>
> Reading through various implementations of Bloom filters as well as papers
describeing new types of Bloom filters and ways to index them and ways to
extract data from them there are a set of low level bit operations (xor,
or, cardinality, and, cardinality of the results of xor, or and and) that
if supported make all that work possible. I will create a separate document
that outlines what those are so that we can discuss them in detail.

 [snip]

>
> > I don’t know of any such usage, however, nesting Bloom filters is done in
> > Attenuated filters and Layered Bloom filters use a number of Bloom filter
> > implementations internally.
>
> What do those use cases require from the "BloomFilter" interface?
>

What those implementations require are the standard match() and merge() as
well as some love level bit operations (see aboce). I will add this
analysis to the separate document I committed to above.


> > > > [...]
> > >
> > Number of bits is not implicitly conveyed by getBitSet() as the BitSet
> will
> > only grow to the size needed to capture the highest bit currently on. So
> > the number of bits may be 72 but the highest on bit might be 50.
>
> OK; doesn't matter now since a common "BitSet" state has been
> ruled out.
> I get that "numberOfBits()" is one of the methods defined by
> the "BloomFilter" interface.
>
> >
> > Hash, number of functions (k), number of items (n), and number of bits
> (m)
> > define the system in which the filter is expressed, call this the
> “Shape”.
> > It is not always necessary to convey the Shape when conveying the bloom
> > filter.
>
> Well, if it is *ever* necessary, then "BloomFilter" must declare a
> "shape()"
> method.  Hence, it could be worth defining a "Shape" subinterface (?).
>
>
See other discussion thread on shape.

[snip]

> >
> > > I'm lost here; I don't know what is "the" reference state, since a
> > > single structure cannot for example represent both a standard
> > > filter and a counting filter...
> > >
> > >
> > Think BitSetI.
>
> Still: No.
> IMHO. ;-)
>
> I meant that the bit set functions defined by BitSetI define the state of
a standard Bloom filter. Since all Bloom filter types are and
implementation of the concept of Bloom filter, all of them must be able to
provide the data requested from the BitSetI functionality.



> IIUC, for a Bloom filter,
>     "state" = "shape" + "set of bits"
>
> *See separate shape discussion.*


> > > >
> > > >> > In addition, the abstract Bloom filter might implement the “data
> > > >> > read” methods from BitSet directly. This would allow more
> efficient
> > > >> > implementation of several methods that currently depend upon
the
> > > >> > creation
> > > >> > of a BitSet and the manipulation of same.
> > > >>
> > > >> Sure; if any and all variants of a "BloomFilter" implementation can
> > > >> be created from a "BitSet", then such a factory method could be
> > > >> part of the "Builder" interface:
> > > >>
> > > >> public interface BloomFilter {
> > > >>     // Core methods...
> > > >>
> > > >>     public interface Builder {
> > > >>         public Builder from(BitSet s); // Factory method.
>
> As above, for large filters, this probably is not a good requirement.
> Should it be
>    public Builder from(BooleanStream s);
> ?
> ["BooleanStream" is not provided by the JDK...]
>

BooleanStream is not but IntStream is, in this case we define the integer
stream to be the indexes of the bits that are on in the filter.  I have
seen this in some implementations.


> > > >>         public BloomFilter build();
> > > >>         // Other builder methods.
> > > >>     }
> > > >> }
> > > >>
>

[snip]


> > One final issue. There was recently a post on the pull request asking why
> > the Bloom filter is an immutable object. We should probably explore why
> or
> > why not.
>
> Part of the answer is universal: because it is trivial to ensure a
> consistent
> state (even in the face of concurrent usage).
> The other part of is the justification, which you gave on GitHub, based on
> expected usage.
>
> > The biggest argument against is that the merge() method would requires
> that
> > the data be duplicated and it makes it hard to do in place updates. For
> > example when updating a “bucket” filter (as defined in the Use cases).
>
> Now that "BloomFilter" is going to be a interface, some of its
> implementations
> could be mutable (e.g. if it would be a requirement for its target usage).
>
> However, arguments about performance should be backed with benchmarks...
>
> In general, I think that code will be more robust if low-level classes are
> immutable; and higher-level is responsible for managing instances (i.e.
> replacing them if outdated).
>
> > But if it is not immutable then the hashCode value changes making it
> > unsuitable for Hash sets. I think that there are classes in Commons that
> > have similar restrictions, and I think that documenting the issue in the
> > class javadoc as well as the hashCode() and equals() (assuming we
> actually
> > override them) would be acceptable.
>
> Are there use cases where Bloom filters would be stored in a "HashSet"?
>
>
I had a use case in an index implementation but it can be worked around.
Having the filter be mutable seem to be a common implementation. I am
leaning in this direction now because immutability makes some usage more
compex without a strong benefit. But I can go either way.

All the best,
Claude
-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message