commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [commons-collections] Claudenw commented on issue #83: WIP: Initial bloom filter code contribution
Date Thu, 24 Oct 2019 10:46:59 GMT
Claudenw commented on issue #83: WIP: Initial bloom filter code contribution
URL: https://github.com/apache/commons-collections/pull/83#issuecomment-545860977
 
 
   On Thu, Oct 24, 2019 at 11:32 AM Alex Herbert <notifications@github.com>
   wrote:
   
   > I think the point is:
   >
   > "either test or set k bits using positions chosen using hash functions.
   > ...
   > if any of the bits are not set, the element certainly does not exist."
   >
   > So you can do the comparison with the first hashing function, then the
   > next, etc, until you have processed all the hashing functions. At any point
   > if a match is not found you can stop thus allowing early exit of your
   > entire comparison.
   >
   > If the hashing function outputs a single bit position then I can see value
   > in partial computation of the entire combined hash.
   >
   > So this is a question of whether part of the entire hash can be computed
   > in a stage and used against the full bits of the Bloom filter. In your
   > example for a filter of length 225,000 bits built using 16 hashing
   > functions are any of these true:
   >
   >    1. Each hashing function identifies only 1 bit
   >
   > TRUE
   
   
   >
   >    1. Each hashing function identifies bits from a non-overlapping range
   >    of the 225,000 bits
   >
   >
   FALSE.  If I understand your statement correctly.  The hash functions can
   return the same bit.  In the worst case you would get the same bit for all
   the hash functions.  In reality you probably get somewhere between 12 and
   16 bits.
   
   >
   >
   > In the example below the hashing function identifies 2 bits. For target A
   > you only need to check the first bit to know that the object is not in the
   > filter. For target B you have to check both.
   >
   > Bloom filter:
   > -----------1--------1------11---------1---------------11---------------1---
   >
   > Target A (not present):
   > --1-----------------------------------------------------------------------1
   >
   > Target B (may be present):
   > -----------1--------------------------1------------------------------------
   >
   >
   Yes you can perform the check as you indicate, but I maintain it will take
   longer to evaluate.
   
   Claude
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message