spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Fregly <ch...@fregly.com>
Subject Re: Spark Streaming S3 Performance Implications
Date Sat, 21 Mar 2015 15:09:28 GMT
hey mike!
you'll definitely want to increase your parallelism by adding more shards to the stream -
as well as spinning up 1 receiver per shard and unioning all the shards per the KinesisWordCount
example that is included with the kinesis streaming package. 
you'll need more cores (cluster) or threads (local) to support this - equalling at least the
number of shards/receivers + 1.
also, it looks like you're writing to S3 per RDD.  you'll want to broaden that out to write
DStream batches - or expand  even further and write window batches (where the window interval
is a multiple of the batch interval).
this goes for any spark streaming implementation - not just Kinesis.
lemme know if that works for you.
thanks!
-Chris 
    _____________________________
From: Mike Trienis <mike.trienis@orcsol.com>
Sent: Wednesday, March 18, 2015 2:45 PM
Subject: Spark Streaming S3 Performance Implications
To:  <user@spark.apache.org>


       Hi All,       
          I am pushing data from Kinesis stream to S3 using Spark Streaming and noticed that
during testing (i.e. master=local[2]) the batches (1 second intervals) were falling behind
the incoming data stream at about 5-10 events / second. It seems that the rdd.saveAsTextFile(s3n://...)
is taking at a few seconds to complete.           
                       val saveFunc = (rdd: RDD[String], time: Time) => {         
   
                         val count = rdd.count()             
                         if (count > 0) {             
                             val s3BucketInterval = time.milliseconds.toString   
         
                            rdd.saveAsTextFile(s3n://...)                       
               }                     }             
                     dataStream.foreachRDD(saveFunc)              
          
          Should I expect the same behaviour in a deployed cluster? Or does the rdd.saveAsTextFile(s3n://...)
distribute the push work to each worker node?          
          "Write the elements of the dataset as a text file (or set of text files) in a given
directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will
call toString on each element to convert it to a line of text in the file."          
          Thanks, Mike. 
Mime
View raw message