commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Herbert (Jira)" <j...@apache.org>
Subject [jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
Date Fri, 25 Oct 2019 20:01:00 GMT

    [ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960025#comment-16960025
] 

Alex Herbert commented on COLLECTIONS-728:
------------------------------------------

Following on from a conversation had on GitHub for the initial BloomFilter code contribution
PR the example there was a Bloom filter created for 10,000 objects with a false positive probability
of 1/50000. This requires a Bloom filter of 225,200 bits with 16 bits set per object hash.
Thus even though the Bloom filter may require a lot of storage to represent the combination
of thousands of objects each object only requires a very small amount of storage. In addition
you only need to find a single bit index in the object hash that is not in the Bloom filter
to determine that there is no match.

Thus the common use case of checking if an object matches the filter is to compute a relatively
small set of indices and check each one is in the Bloom filter. This would suggest the following:
{code:java}
public interface BloomFilter {
    public interface BitIterator {
        /**
         * Get the next set bit, or -1 if no more bits are set.
         *
         * @return the bit index (or -1)
         */
        int nextSetBit();

        static BitIterator of(int... bits) {
            return new BitIterator() {
                int count;

                @Override
                public int nextSetBit() {
                    if (count < bits.length) {
                        return bits[count++];
                    }
                    return -1;
                }
            }
        }
    }

    default boolean match(BitIterator bits) {
        int index = bits.nextSetBit();
        if (index == -1) {
            // Cannot match no indices
            return false;
        }
        while (index != -1) {
            if (!get(index)) {
                return false;
            }
        }
        return true;
    }

    boolean get(int bitIndex);
}{code}
Thus the Bloom filter satisfies the following:
 # A BloomFilter is a representation of the logical OR (combination) of many hashes.
 # A hash is a specification of indices that are set to 1.
 # A BloomFilter can answer the question: is bit index _n_ set?
 # A BloomFilter can answer the question: are all bit indices _N_ set?
 # A "hasher" is able to convert an object to a set of indices.
 # A comparison of each index in turn opens the possibility of dynamic computation of each
hash index with fail fast matching.

To use a BloomFilter would require an implementation of hashing that generates a number of
indices for your objects and a concrete representation of the BloomFilter composite of many
hashes.

This is just an idea and I can see that if constant time access to the method {{get(int)}}
is not available then the filter would be slow. It is an alternative and can be combined with
the current idea to build an entire BloomFilter for each new object and perform a match of
two BloomFilters:
{code:java}
public interface BloomFilter {
    public interface BitIterator {
        /**
         * Get the next set bit, or -1 if no more bits are set.
         *
         * @return the bit index (or -1)
         */
        int nextSetBit();
    }

    boolean match(BloomFilter other);

    boolean match(BitIterator bits);
}
{code}
 

> BloomFilter contribution
> ------------------------
>
>                 Key: COLLECTIONS-728
>                 URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
>             Project: Commons Collections
>          Issue Type: Task
>            Reporter: Claude Warren
>            Priority: Minor
>         Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message