commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sebb <seb...@gmail.com>
Subject Re: New Sub-project Proposal.
Date Mon, 16 Sep 2019 00:16:43 GMT
On Mon, 16 Sep 2019 at 00:17, Gilles Sadowski <gilleseran@gmail.com> wrote:
>
> Hi.
>
> Le sam. 14 sept. 2019 à 08:15, Claude Warren <claude@xenei.com> a écrit :
> >
> > @Gilles
> >
> > I am happy to rename the package without the plural if that is the
> > standard, I will also fix the indent issue.  Is there a definition that can
> > be quickly imported into Eclipse to do the proper formatting?
>
> Hopefully, someone else can answer that one.
>
> > I am adding/updating all comments in the code.
> >
> > FilterConfig contains a main method to make it easy to assess how the
> > various settings influence the length and number of functions in the
> > filter.  This could go into another class or a Utility class but neither of
> > those seem like a better solution.
>
> I don't think there should be a "main" method in a library.
> Also, just printing to stdout does not seem very flexible.
> Perhaps define a "toString()" method?
>
> > I agree with your assessment of final and immutable for methods, variables
> > and classes.
> >
> > I am removing the FilterConfig constructor for BloomFilter and ensuring the
> > BloomFilter makes a defensive copy.
> >
> > The difference between match() and equals() is that equals() is a bit
> > vector equality and match is the standard bloom filter comparison T&C = T.
> > Documentation is being added to the match() and inverseMatch() to explain
> > this.
> >
> > STORED_SIZE is similar to Long.BYTES (perhaps it should be renamed).  It
> > identifies how many bytes the FilterConfig will use when written to a
> > DataOuptut or read from a DataInput.
>
> Why is it necessary to know such internal details?
> Anyways, I think that persistence is best left to higher-level code.
>
> >
> > I think the Utils class can be removed once the other code is moved to
> > Java8 streams and not extended iterators.
> >
> > Hash function is changed to Objects.hash()
> >
> > Constructors are/will be private where factories make sense.  For some the
> > simple constructor makes more sense.  (e.g. BloomFilter(BitSet) )
>
> See
>     https://www.baeldung.com/java-constructors-vs-static-factory-methods
> and
>     https://blog.joda.org/2014/03/valjos-value-java-objects.html
>
> > I have no problem removing the Serializable on FilterConfig, but can you
> > explain (or point to a post) why it should be avoided?
>
> Same as above (not prejudging how application want to handle serialization).

It's difficult to code Serializable classes properly, and just as hard
to maintain them.
Also the serialised format becomes part of the API making changes harder.

We should not promise what we likely cannot continue to deliver,
especially since it may not be needed.

Serializable classes are also a target for code exploits.

> > The update/build methods are being documented and an explanation of the
> > difference included.
>
> Am I wrong that the "build(...)" methods only spare the typing of 2
> characters?
> If so, it's not worth the increase of the API size.
>
> >
> > getLog() is also being documented
> >
> > The asBitSet() method was to extract the bitset from the bloom filter so
> > that it could be stored otherwise used with BitSet based processes.  The
> > name of the method is being chagned to getBitSet() and it makes a defensive
> > copy.
> >
> > The ProtoBloomFilter is not a factory.  It is really a bean and there are
> > times when it makes sense to use the ProtoBloomFilter.  The name is
> > important as it identifies what it is.  It contains the necessary
> > information to build any configuration of BloomFilter.  There is a proposal
> > to use ProtoBloomFilters in service calls.  It could be defined as an
> > interface with a concrete implementation in the ProtoBloomFilterBuilder.
> >
> > As an example of ProtoBloomFilter usage.  Say you have a Person object that
> > has the user's first name, last name, eye color, hair color.
> >
> > The ProtoBloomFilterBuilder would accept the 4 values as 4 calls to
> > update() followed by a build() to build the protoBloomFilter.
> > ProtoBloomFilter proto = ne wProtoBloomFilterBuilder().update( "Claude"
> > ).update( "Warren" ).update( "brown" ).update( "silver" ).build();
>
> I'm still missing why it cannot be a factory, but maybe it is a
> question of design habits.  What about the following:
>
> public class BloomFilter {
>     private final BitSet bitSet;
>
>     private BloomFilter(BitSet bs) {
>         bitSet = (BitSet) (bs.clone());
>     }
>
>     /** Factory method. (cf. "ValJO") */
>     public BloomFilter of(BitSet bs) {
>         return new BloomFilter(bs);
>     }
>
>     // Combined "ProtoBloomFilter" and "ProtoBloomFilterBuilder" logic.
>     public static class Builder {
>          private final Set<Hash> hashes = new HashSet<>();
>
>          private Builder() {}
>
>          /** Factory method(s). */
>          public Builder with(Builder other) {
>               hashes.addAll(other.hashes);
>          }
>          public Builder with(ByteBuffer b) { /* ... */ }
>          // etc.
>
>         /** Factory (converter). */
>         public BloomFilter create(FilterConfig cfg) {
>              final BitSet set = new BitSet(cfg.getNumberOfBits());
>              for (Hash hash : hashes) {
>                   hash.populate(set, cfg);
>              }
>              return BloomFilter.of(set);
>         }
>     }
> }
>
> Perhaps I'm taking too many shortcuts... But, as pointed out by
> Gary, more coverage is required; and unit tests will help figure out
> the leanest design.
>
> >
> > I can create a bloom filter configuration that has 1x10^7  items and
> > accepts 1 / 1x10^6 collisions.
> > FilterConfig config1 = new FilterConfig( 10000000, 1000000)
> >
> > and build the BloomFilter
> > BloomFilter filter1 = proto.create( config1 )
> >
> > I can use that filter to determine if the person is in a collection of
> > 1x10^7 items.  The implementation of that collection could be multiple
> > buckets of  1x10^4 items with a collision rate of 5x10^3 items.  To check
> > that I create an appropriate FilterConfig
> > FilterConfig config2 = new FilterConfig( 10000, 5000 )
> >
> > and then create the proper bloom filter
> > BloomFilter filter2 = proto.create( config2 )
> >
>
> With my above attempt at an alternative API, we'd have (untested):
>
> BloomFilter.Builder proto = BloomFilter
>      .with("Claude")
>      .with("Warren")
>      .with("brown").
>      .with("silver");
>
> BloomFilter filter1 = proto.create(config1);
> BloomFilter filter2 = proto.create(config2);
>
>
> IIUC, in your current implementation of "ProtoBloomFilter", there
> is a side-effect of "build()" (i.e. the call to "clear()"); thus calling
> "build()" twice will not yield the same result.  I don't think that's
> a desirable feature.
>
> Regards,
> Gilles
>
> > Claude
> >
> >
> > On Fri, Sep 13, 2019 at 2:22 PM Gilles Sadowski <gilleseran@gmail.com>
> > wrote:
> >
> > > > > [...]
> > > >
> > > > Gilles,
> > > >
> > > > Take a look at the PR, I added comments to allow the PR to have 0 deps.
> > > >
> > > > Gary
> > > >
> > > >
> > >
> > > IMO, the package should be named "bloomfilter" (without "s").
> > > Naming seems inconsistent in [Collections]: Some (package)
> > > names are singular, other plural.
> > >
> > > * Indent must be 4 spaces.
> > > * All fields and methods must be commented.
> > > * "FilterConfig" contains a "main" method.
> > > * Local variables and fields must be declared "final" whenever constant.
> > > * Some methods are declared "final", others not.
> > > * "BloomFilter" should be made immutable.
> > > * "BloomFilter" could do without a second constructor: "FilterConfig"
> > > should have a method that returns a "BitSet".
> > > * "BloomFilter" must make a defensive copy of its "bitSet" argument.
> > > * What's the difference between "match" and "equals"?
> > > * Why is "STORED_SIZE" public?
> > > * "Utils" classes should be avoided.
> > > * Hash function should be imported/shaded from "Commons Codec"
> > > (and defined there if missing).
> > > * Constructor(s) should be private (instantiation through factory
> > > methods (cf. ValJO).
> > > * "Serializable" should be avoided.
> > > * The "update" methods are missing explanations (or reference) about
> > > their purpose.  Also, it seems that "update" and "build" are redundant.
> > > * Does "getLog" return a standard value?  If so, the reference should
> > > appear in the Javadoc (not just as code comment).
> > > * What is the purpose of method "asBitSet()"?
> > > * (IIUC) "ProtoBloomFilter" should be renamed "BloomFilterFactory".
> > > "ProtoBloomFilterBuilder" should be a static inner class of that
> > > factory.
> > >
> > > Regards,
> > > Gilles
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message