spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sanjay Awatramani <sanjay_a...@yahoo.com>
Subject Implementation problem with Streaming
Date Tue, 25 Mar 2014 18:04:43 GMT
Hi,

I had initially thought of a streaming approach to solve my problem, and I am stuck at few
places and want opinion if this problem is suitable for streaming, or is it better to stick
to basic spark.

Problem: I get chunks of log files in a folder and need to do some analysis on them on an
hourly interval, eg. 11.00 to 11.59. The file chunks may or may not come in real time and
there can be breaks between subsequent chunks.

pseudocode:
While{
  CheckForFile(localFolder)
  CopyToHDFS()
  RDDfile=read(fileFromHDFS)
  RDDHour=RDDHour.union.RDDfile.filter(keyHour=currentHr)
  if(RDDHour.keys().contains(currentHr+1) //next Hr has come, so current Hr should be complete
  {
      RDDHour.process()
      deleteFileFromHDFS()
      RDDHour.empty()
      currentHr++
  }
}

If I use streaming, I face the following problems:
1) Inability to keep a Java Variable (currentHr) in the driver which can be used across batches.
2) The input files may come with a break, for eg. 10.00 - 10.30 comes, then a break for 4
hours. If I use streaming, then I can't process the 10.00 - 10.30 batch as its incomplete,
and the 1 hour DStream window for the 10.30 - 11.00 file will have previous RDD as empty as
nothing was received in the preceding 4 hours. Basically Streaming takes file time as input
and not the time inside the file content. 
3) no control on deleting file from HDFS as the program runs in a SparkStreamingContext loop

Any ideas on overcoming the above limitations or whether streaming is suitable for such kind
of problem or not, will be helpful.

Regards,
Sanjay
Mime
View raw message