commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gilles Sadowski <gillese...@gmail.com>
Subject Re: [collections] BloomFilter or BitSet functions?
Date Sun, 13 Oct 2019 23:27:48 GMT
Hello.

Le dim. 13 oct. 2019 à 21:49, Claude Warren <claude@xenei.com> a écrit :
>
> I did some work on the BloomFiltersFunction idea and found that moving the
> hamming distance and approximate log methods out of the BloomFilter
> interface made implementation much easier and cleaner.  In fact, all the
> calls in my code that is utilizing the proposed contribution got simpler.
> I now firmly believe that this is the proper way forward.  The change is on
> my commons-collection fork under a "funcs" branch.

I'm not so sure (sorry!).
It suspiciously looks like a utility class, and moreover, most of the
method actually require a "BitSet" which the "BloomFilter" interface
mandates.
The fact that it seems to make other codes cleaner is perhaps a
sign that something is wrong with the design.
IMHO, it starts with a question asked before:  Is there going to
be other implementations of the "BloomFilter" interface?  If so, how
would they differ from "StandardBloomFilter"?  And would they use
those utility functions?

Other questions:
* Why, in "BloomNestedCollection", is the "BucketFactory" (nested)
interface is implemented by classes nested in that same class?
* Why do "BloomCollection" and "BloomNestedCollection" provide
the "BloomFilterGated" API?  Isn't the latter an "implementation
detail"?  IOW, what is the usage pattern?

Perhaps those remarks reflect my lack of knowledge of the proposed
functionality, but the size of your contribution makes it difficult to tell
apart what needs to be "public" (from a user perspective) from what
should be "private" (or "protected") as pertaining to the details of the
implementation.  E.g. it looks to me that the functions currently in
"BloomFilterFunctions" should not be part of the public API (?).

Regards,
Gilles

> My plan is to complete the documentation for BloomFilterFunctions, verify
> that I have not added any issues to any of the build reports and then merge
> it back into my contribution WIP.
>
> If anyone objects or thinks that this change is not proper please let me
> know soonest.
>
> Thanks,
> Claude
>
> On Sun, Oct 13, 2019 at 7:07 AM Claude Warren <claude@xenei.com> wrote:
>
> > I believe the functions should be in a separate class as it increases the
> > separation of concerns.
> >
> > The methods are used in Bloom Filter manipulation and analysis but are not
> > specific to bloom filters.  They are in fact specific to bit vectors.
> >
> > My original thought was to create a function class (Name to be determined
> > but lets call it Func for now) that would have static methods like:
> >
> > int Func.hammingDistance( BloomFilter one, BloomFilter two)
> >
> > I think this makes sense as it emphasizes the fact that we are dealing
> > with BloomFilters in this package.  However, the implementation is probably
> > going to be along the lines of
> >
> > BitSet bsOne = one.getBitSet();
> > BitSet bsTwo = two.getBitSet();
> > // manipulate the bitsets here
> >
> > so the question was should there be a BitSet contribution that would be
> >
> > int Func.hammingDistance( BitSet one, BitSet two)
> >
> > The more I think about it the more I think that the BitSet functionality
> > should wait until someone wants it.  Right now I can proceed without it and
> > provide the semantically sound methods for the BloomFilters.
> >
> > Claude
> >
> > On Fri, Oct 11, 2019 at 12:36 AM Gilles Sadowski <gilleseran@gmail.com>
> > wrote:
> >
> >> Hello.
> >>
> >> Le lun. 7 oct. 2019 à 19:42, Claude Warren <claude@xenei.com> a écrit
:
> >> >
> >> > As noted earlier I am preparing a contribution of Bloom Filter classes
> >> to
> >> > the collections module.  As part of this submission there are several
> >> > methods that operate on BitSets that are used as part  of Bloom Filter
> >> > manipulation and analysis.  My question is, should these be contributed
> >> as
> >> > Bloom Filter specific methods or would it be better to submit a BitSet
> >> > function library.
> >>
> >> What do you mean?
> >> What would be the alternative?  How would usage change (from a
> >> user perspective)?  Would it improve the design (e.g. be increasing
> >> the "separation of concerns")?
> >>
> >> Thanks,
> >> Gilles
> >>
> >> >
> >> > The methods in question are:
> >> > hammingDistance() = the cardinality (A xor B)
> >> > jaccardDistance()  = the 1 - jaccardSimilarity()
> >> > jaccardSimilarity() = cardinality(A xor B) / cardinality (A or B)
> >> > cosineDistance() = 1 - cosineSimilarity()
> >> > cosineSimilarity() = cardinality( A and B ) / (Sqrt( cardinality( A ) )
> >> *
> >> > Sqrt( cardinality( B )))
> >> > estimatedLog = estimated log2 of the BitSet if considered a large
> >> unsigned
> >> > int.
> >> >
> >> > Opinions requested.
> >> >
> >> > Claude
> >> > --
> >> > I like: Like Like - The likeliest place on the web
> >> > <http://like-like.xenei.com>
> >> > LinkedIn: http://www.linkedin.com/in/claudewarren
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> >> For additional commands, e-mail: dev-help@commons.apache.org
> >>
> >>
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com>
> > LinkedIn: http://www.linkedin.com/in/claudewarren
> >
>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message