samoa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gianmarco De Francisci Morales <g...@apache.org>
Subject Re: Scalability of Vertical Hoeffding Tree
Date Sun, 14 May 2017 08:45:16 GMT
It is likely that the 1 second fixed sleep is hindering scalability.
There is a parameter in the entrance processor that controls the ingestion
rate, we should be using that one.

The reason that delay is there is that Storm and other processors don't
have a way to specify priorities within streams.
So, if you read as fast as possible all the events from the source and send
them through the model, sometimes the model will not have time to receive
messages from the statistics to update the tree, and basically the model
does not learn anything.
A clean solution to this problem would involve priorities and
back-pressure, but the engines are not there yet, so we need to work around
it.

Try lowering the delay between events, but expect some drop in accuracy as
you do so.

Cheers,

-- Gianmarco

On Mon, May 8, 2017 at 7:27 PM, Shigeru Imai <imais@rpi.edu> wrote:

> Hi Nicolas,
>
> I tested a dataset with 1000 numerical and 1000 nominal attributes, and
> again, the throughput did not scale.
>
> From your experience, would you able to give me an idea how the VHT should
> scale?
> For example, when the number of attributes is X and the parallelism is
> changed from A to B, the speedup of throughput is Y.
>
> Thank you,
> Shigeru
>
>
> On 5/5/2017 5:10 PM, Shigeru Imai wrote:
> > Hi Nicolas,
> >
> > Yes, I have been using those two parameters to scale up the computation.
> > I will try thousands of attributes as you suggested.
> > The bottleneck could be the Kakfa connector, but let's see how it goes...
> >
> > Thank you for your help.
> >
> > Shigeru
> >
> > On 5/5/2017 4:11 AM, Nicolas Kourtellis wrote:
> >> Hi Shigeru,
> >>
> >> I believe you can adjust this parallelism you are asking for by
> modifying
> >> the -p parameter in the VHT algorithm, e.g:
> >> -l (classifiers.trees.VerticalHoeffdingTree -p 2)
> >> will run with 2 parallel statistics.
> >>
> >> There is another option in the file bin/samoa-storm.properties which you
> >> declare that you are running the storm in a cluster mode and you also
> >> define the worker processes allocated to the cluster:
> >> samoa.storm.numworker=2
> >>
> >> You should adjust that one as well. In my setup I found the two
> parameters
> >> needed to be aligned (i.e., the -p and the samoa.storm.numworker) but I
> >> don't know if yours is different.
> >>
> >> Hope this helps,
> >>
> >> Nicolas
> >>
> >> On Tue, May 2, 2017 at 7:45 PM, Shigeru Imai <imais@rpi.edu> wrote:
> >>
> >>> Hi Nicolas,
> >>>
> >>> Thank you for your reply.
> >>>
> >>> I will wait for SAMOA-65 to be available.
> >>>
> >>> I tried a dataset with 100 numerical and 100 nominal attributes
> generated
> >>> with RandomTreeGenerator, but that did not scale either. Again, the
> >>> throughput remained 50 Mbytes/sec up to 32 VMs.
> >>>
> >>> By the way, does the following scaling policy look good to you? Can I
> >>> assume that changing the parallelism of LocalStatisticsProcessor is the
> >>> only way to scale VHT? Or is there any other processor I should change
> the
> >>> parallelism?
> >>>> * Scaling policy: assign one core per LocalStatisticsProcessor
> >>> Regards,
> >>> Shigeru
> >>>
> >>>
> >>> On 5/2/2017 10:25 AM, Nicolas Kourtellis wrote:
> >>>> Hi Shigeru,
> >>>>
> >>>> Thank you for the interest in the VHT algorithm and SAMOA. A couple
of
> >>>> brief comments from first glance:
> >>>>
> >>>> - The particular connector with Kafka was not thoroughly tested and
> that
> >>> is
> >>>> why it was not merged yet with the main.
> >>>> Some teams we are aware of are currently working on a proposed new
> >>>> connector, as you can see from this new open issue:
> >>>> https://issues.apache.org/jira/browse/SAMOA-65
> >>>>
> >>>> - Indeed, when we tested VHT with small set of attributes, the
> benefit of
> >>>> more resources was not obvious, especially in the throughput. Only
> when
> >>> we
> >>>> scaled out the problem to thousands of attributes would scalability
to
> >>> more
> >>>> resources make sense.
> >>>>
> >>>> Hope this helps,
> >>>>
> >>>> Nicolas
> >>>>
> >>>>
> >>>>
> >>>> On Mon, May 1, 2017 at 10:35 PM, Shigeru Imai <imais@rpi.edu>
wrote:
> >>>>
> >>>>> Hello,
> >>>>>
> >>>>> I am testing the scalability of Vertical Hoeffding Tree on
> SAMOA-Storm
> >>>>> consuming streams from Kafka. So far, I have tested up to 32 VMs
of
> >>>>> m4.large on Amazon EC2; however, throughput does not improve almost
> at
> >>> all.
> >>>>> Storm consumes streams at 30 Mbytes/sec from Kafka with 1 VM, and
> this
> >>>>> throughput stays almost the same up to 32 VMs.
> >>>>>
> >>>>> Here are the experimental settings:
> >>>>> * SAMOA: latest on github as of April 2017
> >>>>> * Storm: version 0.10.1
> >>>>> * Dataset: forest covertype (54 attributes,
> >>> https://archive.ics.uci.edu/
> >>>>> ml/datasets/Covertype)
> >>>>> * Kafka connector: implementation proposed for SAMOA-40 (
> >>>>> https://github.com/apache/incubator-samoa/pull/32)
> >>>>> * Scaling policy: assign one core per LocalStatisticsProcessor
> >>>>> * Tested with Prequential Evaluation
> >>>>>
> >>>>> I read the Vertical Hoeffding Tree paper from IEEE BigData 2016,
but
> I
> >>>>> could not find the information on how throughput of VHT scales when
> we
> >>> add
> >>>>> more resources (it only shows relative performance improvements
> >>> compared to
> >>>>> the standard Hoeffding tree).
> >>>>>
> >>>>> Has anyone scale VHT successfully with or without Kafka?  Is there
> any
> >>>>> tips to achieve high throughput with VHT?
> >>>>> I believe using datasets with more attributes leads to a better
> >>>>> scalability for VHT, so I am thinking to try that next, but I think
> 54
> >>>>> attributes should scale at least a little bit.
> >>>>>
> >>>>> Also, I found the following sleep of 1 second in
> >>>>> StormEntranceProcessingItem.java. It looks to me that this hinders
> high
> >>>>> throughput processing. Can we get rid of this sleep?
> >>>>>     public void nextTuple() {
> >>>>>       if (entranceProcessor.hasNext()) {
> >>>>>         Values value = newValues(entranceProcessor.nextEvent());
> >>>>>         collector.emit(outputStream.getOutputId(), value);
> >>>>>       } else
> >>>>>         Utils.sleep(1000);
> >>>>>       // StormTupleInfo tupleInfo = tupleInfoQueue.poll(50,
> >>>>>       // TimeUnit.MILLISECONDS);
> >>>>>       // if (tupleInfo != null) {
> >>>>>       // Values value = newValues(tupleInfo.getContentEvent());
> >>>>>       // collector.emit(tupleInfo.getStormStream().getOutputId(),
> >>> value);
> >>>>>       // }
> >>>>>     }
> >>>>>
> >>>>> Any suggestions would be appreciated.
> >>>>>
> >>>>> Thank you,
> >>>>> Shigeru
> >>>>>
> >>>>> --
> >>>>> Shigeru Imai  <imais@rpi.edu>
> >>>>> Ph.D. candidate
> >>>>> Worldwide Computing Laboratory
> >>>>> Department of Computer Science
> >>>>> Rensselaer Polytechnic Institute
> >>>>> 110 8th Street, Troy, NY 12180, USA
> >>>>> http://wcl.cs.rpi.edu/
> >>>>>
> >>>
> >
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message