commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gilles Sadowski <gillese...@gmail.com>
Subject Re: [collections] Bloom filter - Discussion of Shape
Date Sat, 19 Oct 2019 02:37:51 GMT
Hi.

Le ven. 18 oct. 2019 à 17:55, Claude Warren <claude@xenei.com> a écrit :
>
>  I think the other discussion is getting a bit long so I thought we could
> start this discussion here and see if we can close out the other discussion
> with agreement on the remaining topics.
>
> The “Shape” of a bloom filter (excluding the hash algo) is defined
> mathematically by
>
> Number of Items (AKA: n) n = ceil(m / (-k / log(1 - exp(log(p) / k))))
>
> Probability of Collision (AKA: p) p = (1 – exp(-kn/m))^k
>
> Number of Bits (AKA: m) m = ceil((n * log(p)) / log(1 / pow(2, log(2))))
>
> Number of Functions (AKA: k) k = round((m / n) * log(2))
>
> Since the probability (p) is actually calculated directly from the other 3
> the shape is actually concretely defined by n, m, and k.
>
> This is all in the BloomFilterConfiguration class.
>
> The BloomFilterConfigurationClass has 4 constrcutors (n,p), (p,m,k), (n,m),
> and (n,m,k). in all cases where is an "p" it is assumed to be a desired
> value and values of the other missing properties are calculated. The actual
> value for p is then recalculated from n, m and k.
>
> Now it makes sense that the hash algo should also be specified I am happy
> to assume an enum here for the moment and also assume a mechanism for
> resolving the enum to a concrete hash implementation.
>
> I think that you will find that BloomFilterConfiguration fits the bill but
> perhaps should be renamed to BloomFilterShape.
>
>
>
> There is one other component that we have not addressed lately: The
> ProtoBloomFilter.
>
> The ProtoBloomFilter addresses an issue that arises when there are filters
> with different “shape” but same Hashing Algo.
>
> Since hashing is the expensive operation, we want to try to cache the hash
> values and reuse them rather than recalculate them.
>
> The ProtoBloomFilter contains the results of the hashing in a format that
> is easily used to create filters of the specified shape.
>
> The current implementation uses a circular method whereby a murmur128 hash
> is generated for a parameter to a with() call (e.g. with(“foo”) generates a
> murmur128 hash of “foo”)
> the hash  is split into 2 longs.
>
> Now when the shape has a “k” of:
>   1 we return the first long.
>   2 we add the second long to the first long and return that.
>   3 we add the second long the the result above and return that.
>
> This has been shown to be as randomly distributed as calling
>   murmur128( “foo”, (seed1) )
>   murmur128( “foo”, (seed2) )
>   murmur128( “foo”, (seed3) )
>
> and much faster.
>
> We can do this with any implementation that produces 128 bit hashes.
>
> Once we have the long value for the specific k we take the mod() of that
> and the number of bits in the Bloom filter (m). For Bloom filters that have
> small enough (m) we could use smaller integers and it would not have a
> significant impact on the distribution of the bits.
>
> There is an issue of whether the long value is considered signed or
> unsigned as this significantly impacts the mod(). This is why I suggested a
> string like Murmur128-UC or Murmur128-SC (unsigned or signed).
>
> If instead the ProtoBloomFilter called the hash function directly then it
> would be what I called an incremented call. So calling
>
>   murmur128( “foo”, (seed1) )
>   murmur128( “foo”, (seed2) )
>   murmur128( “foo”, (seed3) )
>
> inside the ProtoBloomFilter would yield a Murmur128-UI or Murmur128-SI hash
> depending on whether the values were assumed to be signed or unsigned for
> the mod() call.
>
> Anyway, we can rename the ProtoBloomFilter to something else, but it
> effectively generates the required number of functions (k) for each with(x)
> parameter.
>
> So you take the BloomFilterConfiguration (Shape) and the ProtoBloomFilter
> (k provider) and you can build the BloomFilter.
>
> There are some problems here. Almost none of the nomenclature is common. So
> we can define as we go. I think there is a need for a proper hashing naming
> were we can determine how the hash was created from the name (like the with
> encryption algo names and similar to the RandomSource code you pointed me
> to), but I have found no standards here.

At the issue page
    https://issues.apache.org/jira/browse/COLLECTIONS-728
I've attached a design suggestion from what I gathered from this discussion.
[It's incomplete and not tested, some names were changed (TBD), and the
commented out part ("HashSource") will not compile until more code is added.]

Prominent feature is the explicit namespace that groups in a single file all the
boiler-plate functionality needed by implementations of the
"BloomFilter" interface.
[This is also in line with the layout of the top-level package of
[Collections] that
contains various interfaces, each with their implementations located
in a specific
sub-package.]

Next step would be to create a
---CUT---
public class BitSetBloomFilter implements BloomFilter {
    private final BitSet state;
    // ...
}
---CUT---

WDYT?

Regards,
Gilles

>
> Claude
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message