spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <t...@databricks.com>
Subject Re: Has anybody ever tried running Spark Streaming on 500 text streams?
Date Tue, 28 Jul 2015 22:42:17 GMT
I dont think any one has really run 500 text streams.
And parSequences do nothing out there, you are only parallelizing the setup
code which does not really compute anything. Also it setsup 500 foreachRDD
operations that will get executed in each batch sequentially, so does not
make sense. The write way to parallelize this is union all the streams.

val streams = streamPaths.map { path =>
  ssc.textFileStream(path)
}
val unionedStream = streamingContext.union(streams)
unionedStream.foreachRDD { rdd =>
  // do something
}

Then there is only one foreachRDD executed in every batch that will process
in parallel all the new files in each batch interval.
TD


On Tue, Jul 28, 2015 at 3:06 PM, Brandon White <bwwinthehouse@gmail.com>
wrote:

> val ssc = new StreamingContext(sc, Minutes(10))
>
> //500 textFile streams watching S3 directories
> val streams = streamPaths.par.map { path =>
>   ssc.textFileStream(path)
> }
>
> streams.par.foreach { stream =>
>   stream.foreachRDD { rdd =>
>     //do something
>   }
> }
>
> ssc.start()
>
> Would something like this scale? What would be the limiting factor to
> performance? What is the best way to parallelize this? Any other ideas on
> design?
>

Mime
View raw message