spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Debasish Ghosh <ghosh.debas...@gmail.com>
Subject Re: using StreamingKMeans
Date Sat, 19 Nov 2016 19:24:44 GMT
Thanks a lot for the response.

Regarding the sampling part - yeah that's what I need to do if there's no
way of titrating the number of clusters online.

I am using something like

dstream.foreachRDD { rdd =>
  if (rdd.count() > 0) { //.. logic
  }
}

Feels a little odd but if that's the idiom then I will stick to it.

regards.



On Sat, Nov 19, 2016 at 10:52 PM, Cody Koeninger <cody@koeninger.org> wrote:

> So I haven't played around with streaming k means at all, but given
> that no one responded to your message a couple of days ago, I'll say
> what I can.
>
> 1. Can you not sample out some % of the stream for training?
> 2. Can you run multiple streams at the same time with different values
> for k and compare their performance?
> 3. foreachRDD is fine in general, can't speak to the specifics.
> 4. If you haven't done any transformations yet on a direct stream,
> foreachRDD will give you a KafkaRDD.  Checking if a KafkaRDD is empty
> is very cheap, it's done on the driver only because the beginning and
> ending offsets are known.  So you should be able to skip empty
> batches.
>
>
>
> On Sat, Nov 19, 2016 at 10:46 AM, debasishg <ghosh.debasish@gmail.com>
> wrote:
> > Hello -
> >
> > I am trying to implement an outlier detection application on streaming
> data.
> > I am a newbie to Spark and hence would like some advice on the confusions
> > that I have ..
> >
> > I am thinking of using StreamingKMeans - is this a good choice ? I have
> one
> > stream of data and I need an online algorithm. But here are some
> questions
> > that immediately come to my mind ..
> >
> > 1. I cannot do separate training, cross validation etc. Is this a good
> idea
> > to do training and prediction online ?
> >
> > 2. The data will be read from the stream coming from Kafka in
> microbatches
> > of (say) 3 seconds. I get a DStream on which I train and get the
> clusters.
> > How can I decide on the number of clusters ? Using StreamingKMeans is
> there
> > any way I can iterate on microbatches with different values of k to find
> the
> > optimal one ?
> >
> > 3. Even if I fix k, after training on every microbatch I get a DStream.
> How
> > can I compute things like clustering score on the DStream ?
> > StreamingKMeansModel has a computeCost function but it takes an RDD. I
> can
> > use dstream.foreachRDD { // process RDD for the micro batch here } - is
> this
> > the idiomatic way ?
> >
> > 4. If I use dstream.foreachRDD { .. } and use functions like new
> > StandardScaler().fit(rdd) to do feature normalization, then it works
> when I
> > have data in the stream. But when the microbatch is empty (say I don't
> have
> > data for some time), the fit method throws exception as it gets an empty
> > collection. Things start working ok when data starts coming back to the
> > stream. But is this the way to go ?
> >
> > any suggestion will be welcome ..
> >
> > regards.
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/using-StreamingKMeans-tp28109.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >
>



-- 
Debasish Ghosh
http://manning.com/ghosh2
http://manning.com/ghosh

Twttr: @debasishg
Blog: http://debasishg.blogspot.com
Code: http://github.com/debasishg

Mime
View raw message