commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Gregory <garydgreg...@gmail.com>
Subject Re: New Sub-project Proposal.
Date Wed, 11 Sep 2019 20:01:16 GMT
On Wed, Sep 11, 2019 at 11:06 AM Claude Warren <claude@xenei.com> wrote:

> First it is important to remember that Bloom filters tell you where things
> are NOT.  Second it is important to understand that Bloom filters can give
> false positives but never false negatives.  Seems kind of pointless I know
> but consider the case where you have 10K buckets that may contain the item
> you are looking for.  If you can reduce the number of buckets you are
> searching you can significantly speed up the search.  In a case like this a
> bloom filter could be used "in front" of each bucket as a gatekeeper.  When
> ever an object goes in the bucket the objects bloom filter is added to the
> bucket bloom filter.  If you want to search the 10K buckets for an item
> then you build the  bloom filter for the item you are looking for and check
> the bloom filter on each bucket.  If the filter says that the item is not
> in the bucket then you can skip that bucket, if the filter says it is in
> the bucket you search the bucket to verify that it is not a false
> positive.  A common use for bloom filters is to determine if an expensive
> call should be made.  For example many browsers have a bloom filter that
> comprises all the known bad URLs (ones that serve malware, etc).  When the
> URL is entered in the browser it is checked against the bloom filter.  If
> it is not there the request goes through as normal.  If it is there then
> the browser makes the expensive lookup call to a server to determine if the
> URL really is in the database of bad URLs.
>
> So a bloom filter is generally used to front a collection to determine if
> the collection should be searched.  And as has been pointed out it doesn't
> make much sense to use it in front of an in-memory hash table.  However,
> applications like Cassandra and Hadoop use bloom filters for various
> reasons.  I have recently been made aware of an Apache Incubator project
> that wants to implement bloom filters as part of their project.  Other uses
> for bloom filters include sharding data.  There is a measure of difference
> between filters called a hamming distance.  This is the number of bits that
> have to be "flipped" to turn one filter into another, and is very similar
> to Hamming measures found in string and other similar comparisons.  By
> using the hamming value it is possible to distribute data among a set of
> buckets by simply putting the value in the bucket that it is "closest" to
> in terms of Hamming distance.  Searcing takes place as noted above.
> However this has some interesting properties.  For example you can add new
> buckets at any time simply by adding an empty bucket and bloom filter to
> the collection of buckets and the system will start filling the bucket as
> appropriate.  In addition if a bucket/shard becomes "full", where "full" is
> an implementation dependent decision (e.g. the index on a DB table reaches
> the inflection point where performance degradation begins), you can pull a
> bucket out of consideration for inserts but still search it without
> significant stress or change to the system.
>
> Internally Bloom filters are bit vectors.  The length of the vector being
> determined by the number of items that are to be placed in the bucket and
> the acceptable hash collision rate.  There is a function that will
> calculate the length of the vector and the number of  functions to use to
> turn on the bits.[1]  In general you build a bloom filter by creating a
> hash and using the modulus of that to determine which bit in the vector to
> turn on.  You then furn a second hash, usually the same hash function with
> a different seed to determine the next bit and so on until the number of
> functions has been executed.  Importantly, there are comments tin the
> Cassandra code that describe a new and much faster way of doing this using
> 128-bit hashes and splitting them into a pair of longs.  To check
> membership in a bloom filter you buid the bloom-filter for the target (T -
> the thing we are looking for) and get the filter for the candidate (C - the
> bucket) and evaluate T&C = T
> if it evaluates as true there is a match if it not then T is guaranteed not
> to be in the bucket.
>
> There are several of us at Apache that work on bloom filters and we have
> been unable to locate an open source library that is under the Apache or
> similar license.  I have done work on a concept call a proto-bloom filter
> that does the hashing early and then makes it faster to generated concrete
> bloom filters of various sizes, thus enabling a more efficient layering of
> filters.
>
> Several of us have done research into ways to index filters so that if you
> have a collection of filters you can quickly locate the candidates.  This
> is not as simple as it sounds due to the way in which filters are checked
> and the issues with over filled filters yielding high false positive
> counts, in addition the check is so fast that the over head for most
> indexing eats up any increase in speed. My research has shown that for
> filter collections of less than 1000 it is always faster to do a linear
> search through an array than any other means.  Above 1000 entries there are
> techniques that can yield faster evaluation in some cases.
>
> The long and short of this is that there is no good unencumbered open
> source library available at the current time.  Myself and several others,
> in conversation here at ApacheCon, have expressed interest in creating such
> a library.  We have fairly mature code that we are willing to contribute
> along with code that embodies new thinking in the bloom filter arena (like
> proto-bloom filters).  We just need a space within the Apache family to
> host it.  The code base seems to small to be a separate project and so we
> come to Apache Commons seeking a home.
>
> Claude
>

Hi Claude,

Thank you for the explainer :-) quite helpful.

Gary


>
>
>
>
> [1] https://hur.st/bloomfilter/
>
> On Wed, Sep 11, 2019 at 12:36 PM Gary Gregory <garydgregory@gmail.com>
> wrote:
>
> > I would like to know more. I am curious since looking up whether an
> element
> > is in a set is done via a hash code. How do you do better than that?
> >
> > Gary
> >
> > On Tue, Sep 10, 2019, 16:51 Bruno P. Kinoshita <kinow@apache.org> wrote:
> >
> > >  +1 Collections sounds like a good place for a bloom filter.
> > >
> > > Bruno
> > >
> > >     On Wednesday, 11 September 2019, 8:00:45 am NZST, Jochen Wiedmann <
> > > jochen.wiedmann@gmail.com> wrote:
> > >
> > >  Hi, Claude,
> > >
> > > having read, what a bloom filter is, a subproject sounds unnecessary
> > > to me. I'd recommend, that you contribute your code to Commons
> > > Collections, which seems to me to be a logical target.
> > >
> > > Jochen
> > >
> > > On Tue, Sep 10, 2019 at 8:45 PM Claude Warren <claude@xenei.com>
> wrote:
> > > >
> > > > Having spoken with several people at ApacheCon, I would like to see a
> > > > bloomfilter sub project.  I have code that is already under Apache
> > > License
> > > > that I am willing to contribute as the basis The goal of the
> > sub-project
> > > > would be to produce a reference implementation that could be used by
> > > other
> > > > projects that desire to have use bloom filters and bloom filter based
> > > > collections.
> > > >
> > > > Is there any objection to doing this?  Other than asking here, what
> is
> > > the
> > > > proper path to get a sub-project created,  What does the Commons PMC
> > > > require?
> > > >
> > > > Any assistance and comments would be apprecieated.
> > > > Claude
> > > >
> > > > --
> > > > I like: Like Like - The likeliest place on the web
> > > > <http://like-like.xenei.com>
> > > > LinkedIn: http://www.linkedin.com/in/claudewarren
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> > > For additional commands, e-mail: dev-help@commons.apache.org
> > >
> > >
> >
>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message