spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mayur Rustagi <>
Subject Re: Spark Streaming for time consuming job
Date Wed, 01 Oct 2014 11:59:02 GMT
Calling collect on anything  is almost always a bad idea. The only
exception is if you are looking to pass that data on to any other system &
never see it again :) .
I would say you need to implement outlier detection on the rdd & process it
in spark itself rather than calling collect on it.


Mayur Rustagi
Ph: +1 (760) 203 3257
@mayur_rustagi <>

On Tue, Sep 30, 2014 at 3:22 PM, Eko Susilo <>

> Hi All,
> I have a problem that i would like to consult about spark streaming.
> I have a spark streaming application that parse a file (which will be
> growing as time passed by)This file contains several columns containing
> lines of numbers,
> these parsing is divided into windows (each 1 minute). Each column
> represent different entity while each row within a column represent the
> same entity (for example, first column represent temprature, second column
> represent humidty, etc, while each row represent the value of each
> attribute). I use PairDStream for each column.
> Afterwards, I need to run a time consuming algorithm (outlier detection,
> for now i use box plot algorithm) for each RDD of each PairDStream.
> To run the outlier detection, currently i am thinking about to call
> collect on each of the PairDStream from method forEachRDD and then i get
> the List of the items, and then pass the each list of items to a thread.
> Each thread runs the outlier detection algorithm and process the result.
> I run the outlier detection in separate thread in order not to put too
> much burden on spark streaming task. So, I would like to ask if this model
> has a risk? or is there any alternatives provided by the framework such
> that i don't have to run a separate thread for this?
> Thank you for your attention.
> --
> Best Regards,
> Eko Susilo

View raw message