spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <ilike...@gmail.com>
Subject Re: Spark Streaming - How to control the parallelism like storm
Date Tue, 22 Oct 2013 19:04:27 GMT
As Mark said, flatMap can only parallelize into as many partitions as exist
in the incoming RDD. socketTextStream() only produces 1 RDD at a time.
However, you can utilize the RDD.coalesce() method to split one RDD into
multiple (excuse the name; it can be used for shrinking or growing the
number of partitions), like so:

val lines = ssc.socketTextStream(args(1), args(2).toInt)
val partitionedLines = stream.transform(rdd => rdd.coalesce(10, shuffle =
true))
val words = partitionedLines.flatMap(_.split(" "))
...

This splits the incoming text stream into 10 partitions, so flatMap can run
up to 10x faster, assuming you have that many worker threads (and ignoring
the increased latency in partitioning the rdd across your nodes).


On Tue, Oct 22, 2013 at 8:21 AM, Mark Hamstra <mark@clearstorydata.com>wrote:

> Not separately at the level of `flatMap` and `map`.  The number of
> partitions in the RDD those operations are working on determines the
> potential parallelism.  The number of worker cores available determines how
> much of that potential can be actualized.
>
>
> On Tue, Oct 22, 2013 at 7:24 AM, Ryan Chan <ryanchan404@gmail.com> wrote:
>
>> In storm, you can control the number of thread with the setSpout/setBolt,
>> and how to do the same with Spark Streaming?
>>
>> e.g.
>>
>> val lines = ssc.socketTextStream(args(1), args(2).toInt)
>> val words = lines.flatMap(_.split(" "))
>> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
>> wordCounts.print()
>> ssc.start()
>>
>>
>> Sound like I cannot tell Spark to tell how many thread to be used with
>> `flatMap` and how many thread to be used with `map` etc, right?
>>
>>
>>
>

Mime
View raw message