commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claude Warren <cla...@xenei.com>
Subject [collections] Bloom filter - Discussion of Shape
Date Fri, 18 Oct 2019 15:49:34 GMT
 I think the other discussion is getting a bit long so I thought we could
start this discussion here and see if we can close out the other discussion
with agreement on the remaining topics.

The “Shape” of a bloom filter (excluding the hash algo) is defined
mathematically by

Number of Items (AKA: n) n = ceil(m / (-k / log(1 - exp(log(p) / k))))

Probability of Collision (AKA: p) p = (1 – exp(-kn/m))^k

Number of Bits (AKA: m) m = ceil((n * log(p)) / log(1 / pow(2, log(2))))

Number of Functions (AKA: k) k = round((m / n) * log(2))

Since the probability (p) is actually calculated directly from the other 3
the shape is actually concretely defined by n, m, and k.

This is all in the BloomFilterConfiguration class.

The BloomFilterConfigurationClass has 4 constrcutors (n,p), (p,m,k), (n,m),
and (n,m,k). in all cases where is an "p" it is assumed to be a desired
value and values of the other missing properties are calculated. The actual
value for p is then recalculated from n, m and k.

Now it makes sense that the hash algo should also be specified I am happy
to assume an enum here for the moment and also assume a mechanism for
resolving the enum to a concrete hash implementation.

I think that you will find that BloomFilterConfiguration fits the bill but
perhaps should be renamed to BloomFilterShape.



There is one other component that we have not addressed lately: The
ProtoBloomFilter.

The ProtoBloomFilter addresses an issue that arises when there are filters
with different “shape” but same Hashing Algo.

Since hashing is the expensive operation, we want to try to cache the hash
values and reuse them rather than recalculate them.

The ProtoBloomFilter contains the results of the hashing in a format that
is easily used to create filters of the specified shape.

The current implementation uses a circular method whereby a murmur128 hash
is generated for a parameter to a with() call (e.g. with(“foo”) generates a
murmur128 hash of “foo”)
the hash  is split into 2 longs.

Now when the shape has a “k” of:
  1 we return the first long.
  2 we add the second long to the first long and return that.
  3 we add the second long the the result above and return that.

This has been shown to be as randomly distributed as calling
  murmur128( “foo”, (seed1) )
  murmur128( “foo”, (seed2) )
  murmur128( “foo”, (seed3) )

and much faster.

We can do this with any implementation that produces 128 bit hashes.

Once we have the long value for the specific k we take the mod() of that
and the number of bits in the Bloom filter (m). For Bloom filters that have
small enough (m) we could use smaller integers and it would not have a
significant impact on the distribution of the bits.

There is an issue of whether the long value is considered signed or
unsigned as this significantly impacts the mod(). This is why I suggested a
string like Murmur128-UC or Murmur128-SC (unsigned or signed).

If instead the ProtoBloomFilter called the hash function directly then it
would be what I called an incremented call. So calling

  murmur128( “foo”, (seed1) )
  murmur128( “foo”, (seed2) )
  murmur128( “foo”, (seed3) )

inside the ProtoBloomFilter would yield a Murmur128-UI or Murmur128-SI hash
depending on whether the values were assumed to be signed or unsigned for
the mod() call.

Anyway, we can rename the ProtoBloomFilter to something else, but it
effectively generates the required number of functions (k) for each with(x)
parameter.

So you take the BloomFilterConfiguration (Shape) and the ProtoBloomFilter
(k provider) and you can build the BloomFilter.

There are some problems here. Almost none of the nomenclature is common. So
we can define as we go. I think there is a need for a proper hashing naming
were we can determine how the hash was created from the name (like the with
encryption algo names and similar to the RandomSource code you pointed me
to), but I have found no standards here.

Claude

-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message