spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Spark Streaming S3 Performance Implications
Date Sat, 21 Mar 2015 16:26:59 GMT
Mike:
Once hadoop 2.7.0 is released, you should be able to enjoy the enhanced
performance of s3a.
See HADOOP-11571

Cheers

On Sat, Mar 21, 2015 at 8:09 AM, Chris Fregly <chris@fregly.com> wrote:

> hey mike!
>
> you'll definitely want to increase your parallelism by adding more shards
> to the stream - as well as spinning up 1 receiver per shard and unioning
> all the shards per the KinesisWordCount example that is included with the
> kinesis streaming package.
>
> you'll need more cores (cluster) or threads (local) to support this -
> equalling at least the number of shards/receivers + 1.
>
> also, it looks like you're writing to S3 per RDD.  you'll want to broaden
> that out to write DStream batches - or expand  even further and write
> window batches (where the window interval is a multiple of the batch
> interval).
>
> this goes for any spark streaming implementation - not just Kinesis.
>
> lemme know if that works for you.
>
> thanks!
>
> -Chris
> _____________________________
> From: Mike Trienis <mike.trienis@orcsol.com>
> Sent: Wednesday, March 18, 2015 2:45 PM
> Subject: Spark Streaming S3 Performance Implications
> To: <user@spark.apache.org>
>
>
>
>  Hi All,
>
>  I am pushing data from Kinesis stream to S3 using Spark Streaming and
> noticed that during testing (i.e. master=local[2]) the batches (1 second
> intervals) were falling behind the incoming data stream at about 5-10
> events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking
> at a few seconds to complete.
>
>           val saveFunc = (rdd: RDD[String], time: Time) => {
>
>              val count = rdd.count()
>
>              if (count > 0) {
>
>                  val s3BucketInterval = time.milliseconds.toString
>
>                 rdd.saveAsTextFile(s3n://...)
>
>              }
>          }
>
>          dataStream.foreachRDD(saveFunc)
>
>
>  Should I expect the same behaviour in a deployed cluster? Or does the
> rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node?
>
>  "Write the elements of the dataset as a text file (or set of text files)
> in a given directory in the local filesystem, HDFS or any other
> Hadoop-supported file system. Spark will call toString on each element to
> convert it to a line of text in the file."
>
>  Thanks, Mike.
>
>
>

Mime
View raw message