spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Debasish Ghosh <ghosh.debas...@gmail.com>
Subject Re: using StreamingKMeans
Date Sat, 19 Nov 2016 23:29:07 GMT
Looking for alternative suggestions in case where we have 1 continuous
stream of data. Offline training and online prediction can be one option if
we can have an alternate set of data to train. But if it's one single
stream you don't have separate sets for training or cross validation.

So whatever data u get in each micro batch, train on them and u get the
cluster centroids from the model. Then apply some heuristics like mean
distance from centroid and detect outliers. So for every microbatch u get
the outliers based on the model and u can control forgetfulness of the
model through the decay factor that u specify for StramingKMeans.

Suggestions ?

regards.

On Sun, 20 Nov 2016 at 3:51 AM, ayan guha <guha.ayan@gmail.com> wrote:

> Curious why do you want to train your models every 3 secs?
> On 20 Nov 2016 06:25, "Debasish Ghosh" <ghosh.debasish@gmail.com> wrote:
>
> Thanks a lot for the response.
>
> Regarding the sampling part - yeah that's what I need to do if there's no
> way of titrating the number of clusters online.
>
> I am using something like
>
> dstream.foreachRDD { rdd =>
>   if (rdd.count() > 0) { //.. logic
>   }
> }
>
> Feels a little odd but if that's the idiom then I will stick to it.
>
> regards.
>
>
>
> On Sat, Nov 19, 2016 at 10:52 PM, Cody Koeninger <cody@koeninger.org>
> wrote:
>
> So I haven't played around with streaming k means at all, but given
> that no one responded to your message a couple of days ago, I'll say
> what I can.
>
> 1. Can you not sample out some % of the stream for training?
> 2. Can you run multiple streams at the same time with different values
> for k and compare their performance?
> 3. foreachRDD is fine in general, can't speak to the specifics.
> 4. If you haven't done any transformations yet on a direct stream,
> foreachRDD will give you a KafkaRDD.  Checking if a KafkaRDD is empty
> is very cheap, it's done on the driver only because the beginning and
> ending offsets are known.  So you should be able to skip empty
> batches.
>
>
>
> On Sat, Nov 19, 2016 at 10:46 AM, debasishg <ghosh.debasish@gmail.com>
> wrote:
> > Hello -
> >
> > I am trying to implement an outlier detection application on streaming
> data.
> > I am a newbie to Spark and hence would like some advice on the confusions
> > that I have ..
> >
> > I am thinking of using StreamingKMeans - is this a good choice ? I have
> one
> > stream of data and I need an online algorithm. But here are some
> questions
> > that immediately come to my mind ..
> >
> > 1. I cannot do separate training, cross validation etc. Is this a good
> idea
> > to do training and prediction online ?
> >
> > 2. The data will be read from the stream coming from Kafka in
> microbatches
> > of (say) 3 seconds. I get a DStream on which I train and get the
> clusters.
> > How can I decide on the number of clusters ? Using StreamingKMeans is
> there
> > any way I can iterate on microbatches with different values of k to find
> the
> > optimal one ?
> >
> > 3. Even if I fix k, after training on every microbatch I get a DStream.
> How
> > can I compute things like clustering score on the DStream ?
> > StreamingKMeansModel has a computeCost function but it takes an RDD. I
> can
> > use dstream.foreachRDD { // process RDD for the micro batch here } - is
> this
> > the idiomatic way ?
> >
> > 4. If I use dstream.foreachRDD { .. } and use functions like new
> > StandardScaler().fit(rdd) to do feature normalization, then it works
> when I
> > have data in the stream. But when the microbatch is empty (say I don't
> have
> > data for some time), the fit method throws exception as it gets an empty
> > collection. Things start working ok when data starts coming back to the
> > stream. But is this the way to go ?
> >
> > any suggestion will be welcome ..
> >
> > regards.
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/using-StreamingKMeans-tp28109.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >
>
>
>
>
> --
> Debasish Ghosh
> http://manning.com/ghosh2
> http://manning.com/ghosh
>
> Twttr: @debasishg
> Blog: http://debasishg.blogspot.com
> Code: http://github.com/debasishg
>
> --
Sent from my iPhone

Mime
View raw message