spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Debasish Ghosh <>
Subject outlier detection using StreamingKMeans
Date Thu, 17 Nov 2016 14:03:10 GMT
Hello -

I am trying to implement an outlier detection application on streaming
data. I am a newbie to Spark and hence would like some advice on the
confusions that I have ..

I am thinking of using StreamingKMeans - is this a good choice ? I have one
stream of data and I need an online algorithm. But here are some questions
that immediately come to my mind ..

   1. I cannot do separate training, cross validation etc. Is this a good
   idea to do training and prediction online ?
   2. The data will be read from the stream coming from Kafka in
   microbatches of (say) 3 seconds. I get a DStream on which I train and
   get the clusters. How can I decide on the number of clusters ? Using
   StreamingKMeans is there any way I can iterate on microbatches with
   different values of k to find the optimal one ?
   3. Even if I fix k, after training on every microbatch I get a DStream.
   How can I compute things like clustering score on the DStream ?
   StreamingKMeansModel has a computeCost function but it takes an RDD. May
   be using DStream.foreachRDD { //.. can work, but I am not able to figure
   out how. How can we compute the cost of clustering for an unbounded list of
   data ? Any idiomatic way to handle this ?

Or is StreamingKMeans is not the right choice to do anomaly detection in an
online setting ..

any suggestion will be welcome ..


Debasish Ghosh

Twttr: @debasishg

View raw message