commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claude Warren <cla...@xenei.com>
Subject Re: [collections] BloomFilter or BitSet functions?
Date Sun, 13 Oct 2019 19:49:15 GMT
I did some work on the BloomFiltersFunction idea and found that moving the
hamming distance and approximate log methods out of the BloomFilter
interface made implementation much easier and cleaner.  In fact, all the
calls in my code that is utilizing the proposed contribution got simpler.
I now firmly believe that this is the proper way forward.  The change is on
my commons-collection fork under a "funcs" branch.

My plan is to complete the documentation for BloomFilterFunctions, verify
that I have not added any issues to any of the build reports and then merge
it back into my contribution WIP.

If anyone objects or thinks that this change is not proper please let me
know soonest.

Thanks,
Claude

On Sun, Oct 13, 2019 at 7:07 AM Claude Warren <claude@xenei.com> wrote:

> I believe the functions should be in a separate class as it increases the
> separation of concerns.
>
> The methods are used in Bloom Filter manipulation and analysis but are not
> specific to bloom filters.  They are in fact specific to bit vectors.
>
> My original thought was to create a function class (Name to be determined
> but lets call it Func for now) that would have static methods like:
>
> int Func.hammingDistance( BloomFilter one, BloomFilter two)
>
> I think this makes sense as it emphasizes the fact that we are dealing
> with BloomFilters in this package.  However, the implementation is probably
> going to be along the lines of
>
> BitSet bsOne = one.getBitSet();
> BitSet bsTwo = two.getBitSet();
> // manipulate the bitsets here
>
> so the question was should there be a BitSet contribution that would be
>
> int Func.hammingDistance( BitSet one, BitSet two)
>
> The more I think about it the more I think that the BitSet functionality
> should wait until someone wants it.  Right now I can proceed without it and
> provide the semantically sound methods for the BloomFilters.
>
> Claude
>
> On Fri, Oct 11, 2019 at 12:36 AM Gilles Sadowski <gilleseran@gmail.com>
> wrote:
>
>> Hello.
>>
>> Le lun. 7 oct. 2019 à 19:42, Claude Warren <claude@xenei.com> a écrit :
>> >
>> > As noted earlier I am preparing a contribution of Bloom Filter classes
>> to
>> > the collections module.  As part of this submission there are several
>> > methods that operate on BitSets that are used as part  of Bloom Filter
>> > manipulation and analysis.  My question is, should these be contributed
>> as
>> > Bloom Filter specific methods or would it be better to submit a BitSet
>> > function library.
>>
>> What do you mean?
>> What would be the alternative?  How would usage change (from a
>> user perspective)?  Would it improve the design (e.g. be increasing
>> the "separation of concerns")?
>>
>> Thanks,
>> Gilles
>>
>> >
>> > The methods in question are:
>> > hammingDistance() = the cardinality (A xor B)
>> > jaccardDistance()  = the 1 - jaccardSimilarity()
>> > jaccardSimilarity() = cardinality(A xor B) / cardinality (A or B)
>> > cosineDistance() = 1 - cosineSimilarity()
>> > cosineSimilarity() = cardinality( A and B ) / (Sqrt( cardinality( A ) )
>> *
>> > Sqrt( cardinality( B )))
>> > estimatedLog = estimated log2 of the BitSet if considered a large
>> unsigned
>> > int.
>> >
>> > Opinions requested.
>> >
>> > Claude
>> > --
>> > I like: Like Like - The likeliest place on the web
>> > <http://like-like.xenei.com>
>> > LinkedIn: http://www.linkedin.com/in/claudewarren
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message