commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [commons-collections] Claudenw commented on issue #83: WIP: Initial bloom filter code contribution
Date Thu, 24 Oct 2019 15:19:02 GMT
Claudenw commented on issue #83: WIP: Initial bloom filter code contribution
URL: https://github.com/apache/commons-collections/pull/83#issuecomment-545967930
 
 
   Are you suggesting then that a Bloom filter should accept an iterator or
   stream of int and perform comparisons against that as well as the standard
   Bloom filter match calculation?
   
   
   On Thu, Oct 24, 2019 at 1:52 PM Alex Herbert <notifications@github.com>
   wrote:
   
   > If the 225,000 bits of a bloom filter are 3519 longs then you have 3519
   > ((filter & target) == target) operations with possible early exit:
   >
   > long[] filter = ...
   > long[] target = ...
   > for (int i = 0; i < filter.length; i++) {
   >     if ((filter[i] & target[i]) == target[i]) {
   >         return NO_MATCH;
   >     }
   > }
   > return MATCH;
   >
   > If the expected number of bits turned on is 16 this seems wasteful, but
   > I've not worked through the probability distribution of bits and the
   > expected number of loop executions.
   >
   > The key point is that we can skip all the zeros in the target. Since this:
   >
   > if ((filter[i] & target[i]) == target[i])
   >
   > is always true when the target is zero. It is only when the target is not
   > zero that a result is of interest.
   >
   > If you know that you only have 16 bits to compare you just isolate those:
   >
   > // Dynamically computed bits of the hash function
   > IntStream bits = ...
   >
   > boolean noMatch = bits.filter(bit -> {
   >     long value = filter[bit / 64];
   >     // Check the bit is not set
   >     return (value & (1L << (bit & 63))) == 0;
   > }).findAny().isPresent();
   >
   > I've used the streams API here but you could just compute each hash bit in
   > a loop:
   >
   > for (int i = 0; i < 16; i++) {
   >     // Dynamically computed bits of the hash function
   >     int bit = computeHash(i);
   >     long value = filter[bit / 64];
   >     // Check the bit is not set
   >     if (value & (1L << (bit & 63))) == 0) {
   >         return NO_MATCH;
   >     }
   > }
   > return MATCH;
   >
   > There are more operations per loop with the shifts used to isolate the
   > correct bit but only 16 possible loops.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/commons-collections/pull/83?email_source=notifications&email_token=AASTVHU7IUWHFZ3PWNHP2CLQQGLCPA5CNFSM4IWGTGA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECE5GNQ#issuecomment-545903414>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AASTVHXUNQMB2BSWCDKUPJLQQGLCPANCNFSM4IWGTGAQ>
   > .
   >
   
   
   -- 
   I like: Like Like - The likeliest place on the web
   <http://like-like.xenei.com>
   LinkedIn: http://www.linkedin.com/in/claudewarren
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message