spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mayur Rustagi <mayur.rust...@gmail.com>
Subject Re: Implementation problem with Streaming
Date Wed, 26 Mar 2014 02:12:07 GMT
2 good benefits of Streaming
1. maintains windows as you move across time, removing & adding monads as
you move through the window
2. Connect with streaming systems like kafka to import data as it comes &
process it

You dont seem to need any of these features, you would be better off using
Spark with crontab maybe :), serializing your object in HDFS if its huge,
or maintaining it in memory.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Tue, Mar 25, 2014 at 2:04 PM, Sanjay Awatramani <sanjay_awat@yahoo.com>wrote:

> Hi,
>
> I had initially thought of a streaming approach to solve my problem, and I
> am stuck at few places and want opinion if this problem is suitable for
> streaming, or is it better to stick to basic spark.
>
> Problem: I get chunks of log files in a folder and need to do some
> analysis on them on an hourly interval, eg. 11.00 to 11.59. The file chunks
> may or may not come in real time and there can be breaks between subsequent
> chunks.
>
> pseudocode:
> While{
>   CheckForFile(localFolder)
>   CopyToHDFS()
>   RDDfile=read(fileFromHDFS)
>   RDDHour=RDDHour.union.RDDfile.filter(keyHour=currentHr)
>   if(RDDHour.keys().contains(currentHr+1) //next Hr has come, so current
> Hr should be complete
>   {
>       RDDHour.process()
>       deleteFileFromHDFS()
>       RDDHour.empty()
>       currentHr++
>   }
> }
>
> If I use streaming, I face the following problems:
> 1) Inability to keep a Java Variable (currentHr) in the driver which can
> be used across batches.
> 2) The input files may come with a break, for eg. 10.00 - 10.30 comes,
> then a break for 4 hours. If I use streaming, then I can't process the
> 10.00 - 10.30 batch as its incomplete, and the 1 hour DStream window for
> the 10.30 - 11.00 file will have previous RDD as empty as nothing was
> received in the preceding 4 hours. Basically Streaming takes file time as
> input and not the time inside the file content.
> 3) no control on deleting file from HDFS as the program runs in a
> SparkStreamingContext loop
>
> Any ideas on overcoming the above limitations or whether streaming is
> suitable for such kind of problem or not, will be helpful.
>
> Regards,
> Sanjay
>

Mime
View raw message