commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claude Warren <cla...@xenei.com>
Subject Re: New Sub-project Proposal.
Date Mon, 16 Sep 2019 01:31:58 GMT
I am refactoring some of the code to make the Builder and Hash classes
enclosed in the ProtoBloomFilter class.  I have also add the proxy
serializer as noted in the effective java talk.  This simplifies the
classes significantly.  I will have unit tests and fix the javadocs before
the next push.

Claude

On Mon, Sep 16, 2019 at 2:01 AM Gary Gregory <garydgregory@gmail.com> wrote:

> On Sun, Sep 15, 2019 at 8:17 PM sebb <sebbaz@gmail.com> wrote:
>
> > On Mon, 16 Sep 2019 at 00:17, Gilles Sadowski <gilleseran@gmail.com>
> > wrote:
> > >
> > > Hi.
> > >
> > > Le sam. 14 sept. 2019 à 08:15, Claude Warren <claude@xenei.com> a
> écrit
> > :
> > > >
> > > > @Gilles
> > > >
> > > > I am happy to rename the package without the plural if that is the
> > > > standard, I will also fix the indent issue.  Is there a definition
> > that can
> > > > be quickly imported into Eclipse to do the proper formatting?
> > >
> > > Hopefully, someone else can answer that one.
> > >
> > > > I am adding/updating all comments in the code.
> > > >
> > > > FilterConfig contains a main method to make it easy to assess how the
> > > > various settings influence the length and number of functions in the
> > > > filter.  This could go into another class or a Utility class but
> > neither of
> > > > those seem like a better solution.
> > >
> > > I don't think there should be a "main" method in a library.
> > > Also, just printing to stdout does not seem very flexible.
> > > Perhaps define a "toString()" method?
> > >
> > > > I agree with your assessment of final and immutable for methods,
> > variables
> > > > and classes.
> > > >
> > > > I am removing the FilterConfig constructor for BloomFilter and
> > ensuring the
> > > > BloomFilter makes a defensive copy.
> > > >
> > > > The difference between match() and equals() is that equals() is a bit
> > > > vector equality and match is the standard bloom filter comparison T&C
> > = T.
> > > > Documentation is being added to the match() and inverseMatch() to
> > explain
> > > > this.
> > > >
> > > > STORED_SIZE is similar to Long.BYTES (perhaps it should be renamed).
> > It
> > > > identifies how many bytes the FilterConfig will use when written to a
> > > > DataOuptut or read from a DataInput.
> > >
> > > Why is it necessary to know such internal details?
> > > Anyways, I think that persistence is best left to higher-level code.
> > >
> > > >
> > > > I think the Utils class can be removed once the other code is moved
> to
> > > > Java8 streams and not extended iterators.
> > > >
> > > > Hash function is changed to Objects.hash()
> > > >
> > > > Constructors are/will be private where factories make sense.  For
> some
> > the
> > > > simple constructor makes more sense.  (e.g. BloomFilter(BitSet) )
> > >
> > > See
> > >
> https://www.baeldung.com/java-constructors-vs-static-factory-methods
> > > and
> > >     https://blog.joda.org/2014/03/valjos-value-java-objects.html
> > >
> > > > I have no problem removing the Serializable on FilterConfig, but can
> > you
> > > > explain (or point to a post) why it should be avoided?
> > >
> > > Same as above (not prejudging how application want to handle
> > serialization).
> >
> > It's difficult to code Serializable classes properly, and just as hard
> > to maintain them.
> > Also the serialised format becomes part of the API making changes harder.
> >
> > We should not promise what we likely cannot continue to deliver,
> > especially since it may not be needed.
> >
> > Serializable classes are also a target for code exploits.
> >
>
> +1 to removing Serializable support in its current implementation. If
> serialization is a requirement, then it should be implemented using a
> Serialization Proxy a la J. Block:
>
> https://www.youtube.com/watch?v=V1vQf4qyMXg&feature=youtu.be&t=56m12s
>
> Gary
>
>
> > > > The update/build methods are being documented and an explanation of
> the
> > > > difference included.
> > >
> > > Am I wrong that the "build(...)" methods only spare the typing of 2
> > > characters?
> > > If so, it's not worth the increase of the API size.
> > >
> > > >
> > > > getLog() is also being documented
> > > >
> > > > The asBitSet() method was to extract the bitset from the bloom filter
> > so
> > > > that it could be stored otherwise used with BitSet based processes.
> > The
> > > > name of the method is being chagned to getBitSet() and it makes a
> > defensive
> > > > copy.
> > > >
> > > > The ProtoBloomFilter is not a factory.  It is really a bean and there
> > are
> > > > times when it makes sense to use the ProtoBloomFilter.  The name is
> > > > important as it identifies what it is.  It contains the necessary
> > > > information to build any configuration of BloomFilter.  There is a
> > proposal
> > > > to use ProtoBloomFilters in service calls.  It could be defined as an
> > > > interface with a concrete implementation in the
> > ProtoBloomFilterBuilder.
> > > >
> > > > As an example of ProtoBloomFilter usage.  Say you have a Person
> object
> > that
> > > > has the user's first name, last name, eye color, hair color.
> > > >
> > > > The ProtoBloomFilterBuilder would accept the 4 values as 4 calls to
> > > > update() followed by a build() to build the protoBloomFilter.
> > > > ProtoBloomFilter proto = ne wProtoBloomFilterBuilder().update(
> "Claude"
> > > > ).update( "Warren" ).update( "brown" ).update( "silver" ).build();
> > >
> > > I'm still missing why it cannot be a factory, but maybe it is a
> > > question of design habits.  What about the following:
> > >
> > > public class BloomFilter {
> > >     private final BitSet bitSet;
> > >
> > >     private BloomFilter(BitSet bs) {
> > >         bitSet = (BitSet) (bs.clone());
> > >     }
> > >
> > >     /** Factory method. (cf. "ValJO") */
> > >     public BloomFilter of(BitSet bs) {
> > >         return new BloomFilter(bs);
> > >     }
> > >
> > >     // Combined "ProtoBloomFilter" and "ProtoBloomFilterBuilder" logic.
> > >     public static class Builder {
> > >          private final Set<Hash> hashes = new HashSet<>();
> > >
> > >          private Builder() {}
> > >
> > >          /** Factory method(s). */
> > >          public Builder with(Builder other) {
> > >               hashes.addAll(other.hashes);
> > >          }
> > >          public Builder with(ByteBuffer b) { /* ... */ }
> > >          // etc.
> > >
> > >         /** Factory (converter). */
> > >         public BloomFilter create(FilterConfig cfg) {
> > >              final BitSet set = new BitSet(cfg.getNumberOfBits());
> > >              for (Hash hash : hashes) {
> > >                   hash.populate(set, cfg);
> > >              }
> > >              return BloomFilter.of(set);
> > >         }
> > >     }
> > > }
> > >
> > > Perhaps I'm taking too many shortcuts... But, as pointed out by
> > > Gary, more coverage is required; and unit tests will help figure out
> > > the leanest design.
> > >
> > > >
> > > > I can create a bloom filter configuration that has 1x10^7  items and
> > > > accepts 1 / 1x10^6 collisions.
> > > > FilterConfig config1 = new FilterConfig( 10000000, 1000000)
> > > >
> > > > and build the BloomFilter
> > > > BloomFilter filter1 = proto.create( config1 )
> > > >
> > > > I can use that filter to determine if the person is in a collection
> of
> > > > 1x10^7 items.  The implementation of that collection could be
> multiple
> > > > buckets of  1x10^4 items with a collision rate of 5x10^3 items.  To
> > check
> > > > that I create an appropriate FilterConfig
> > > > FilterConfig config2 = new FilterConfig( 10000, 5000 )
> > > >
> > > > and then create the proper bloom filter
> > > > BloomFilter filter2 = proto.create( config2 )
> > > >
> > >
> > > With my above attempt at an alternative API, we'd have (untested):
> > >
> > > BloomFilter.Builder proto = BloomFilter
> > >      .with("Claude")
> > >      .with("Warren")
> > >      .with("brown").
> > >      .with("silver");
> > >
> > > BloomFilter filter1 = proto.create(config1);
> > > BloomFilter filter2 = proto.create(config2);
> > >
> > >
> > > IIUC, in your current implementation of "ProtoBloomFilter", there
> > > is a side-effect of "build()" (i.e. the call to "clear()"); thus
> calling
> > > "build()" twice will not yield the same result.  I don't think that's
> > > a desirable feature.
> > >
> > > Regards,
> > > Gilles
> > >
> > > > Claude
> > > >
> > > >
> > > > On Fri, Sep 13, 2019 at 2:22 PM Gilles Sadowski <
> gilleseran@gmail.com>
> > > > wrote:
> > > >
> > > > > > > [...]
> > > > > >
> > > > > > Gilles,
> > > > > >
> > > > > > Take a look at the PR, I added comments to allow the PR to have
0
> > deps.
> > > > > >
> > > > > > Gary
> > > > > >
> > > > > >
> > > > >
> > > > > IMO, the package should be named "bloomfilter" (without "s").
> > > > > Naming seems inconsistent in [Collections]: Some (package)
> > > > > names are singular, other plural.
> > > > >
> > > > > * Indent must be 4 spaces.
> > > > > * All fields and methods must be commented.
> > > > > * "FilterConfig" contains a "main" method.
> > > > > * Local variables and fields must be declared "final" whenever
> > constant.
> > > > > * Some methods are declared "final", others not.
> > > > > * "BloomFilter" should be made immutable.
> > > > > * "BloomFilter" could do without a second constructor:
> "FilterConfig"
> > > > > should have a method that returns a "BitSet".
> > > > > * "BloomFilter" must make a defensive copy of its "bitSet"
> argument.
> > > > > * What's the difference between "match" and "equals"?
> > > > > * Why is "STORED_SIZE" public?
> > > > > * "Utils" classes should be avoided.
> > > > > * Hash function should be imported/shaded from "Commons Codec"
> > > > > (and defined there if missing).
> > > > > * Constructor(s) should be private (instantiation through factory
> > > > > methods (cf. ValJO).
> > > > > * "Serializable" should be avoided.
> > > > > * The "update" methods are missing explanations (or reference)
> about
> > > > > their purpose.  Also, it seems that "update" and "build" are
> > redundant.
> > > > > * Does "getLog" return a standard value?  If so, the reference
> should
> > > > > appear in the Javadoc (not just as code comment).
> > > > > * What is the purpose of method "asBitSet()"?
> > > > > * (IIUC) "ProtoBloomFilter" should be renamed "BloomFilterFactory".
> > > > > "ProtoBloomFilterBuilder" should be a static inner class of that
> > > > > factory.
> > > > >
> > > > > Regards,
> > > > > Gilles
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> > > For additional commands, e-mail: dev-help@commons.apache.org
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> > For additional commands, e-mail: dev-help@commons.apache.org
> >
> >
>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message