spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Wendell <pwend...@gmail.com>
Subject Re: Spark Streaming - How to control the parallelism like storm
Date Wed, 23 Oct 2013 06:25:29 GMT
This is something we should add directly to the streaming API rather than
requiring a transform() call. In fact, it's one of the really nice things
about Spark Streaming, you can dynamically change the parallelism any time
in the stream declaration by "coalesce'ing" the input stream. For streams
this is especially useful, because they may be collected on a single node
but the heavy lifting requires the resources of the entire cluster.

- Patrick

On Tue, Oct 22, 2013 at 12:17 PM, Aaron Davidson <ilikerps@gmail.com> wrote:

>
>
> ---------- Forwarded message ----------
> From: Aaron Davidson <ilikerps@gmail.com>
> Date: Tue, Oct 22, 2013 at 12:04 PM
> Subject: Re: Spark Streaming - How to control the parallelism like storm
> To: user@spark.incubator.apache.org
>
>
> As Mark said, flatMap can only parallelize into as many partitions as
> exist in the incoming RDD. socketTextStream() only produces 1 RDD at a
> time. However, you can utilize the RDD.coalesce() method to split one RDD
> into multiple (excuse the name; it can be used for shrinking or growing the
> number of partitions), like so:
>
> val lines = ssc.socketTextStream(args(1), args(2).toInt)
> val partitionedLines = stream.transform(rdd => rdd.coalesce(10, shuffle =
> true))
> val words = partitionedLines.flatMap(_.split(" "))
> ...
>
> This splits the incoming text stream into 10 partitions, so flatMap can
> run up to 10x faster, assuming you have that many worker threads (and
> ignoring the increased latency in partitioning the rdd across your nodes).
>
>
> On Tue, Oct 22, 2013 at 8:21 AM, Mark Hamstra <mark@clearstorydata.com>wrote:
>
>> Not separately at the level of `flatMap` and `map`.  The number of
>> partitions in the RDD those operations are working on determines the
>> potential parallelism.  The number of worker cores available determines how
>> much of that potential can be actualized.
>>
>>
>> On Tue, Oct 22, 2013 at 7:24 AM, Ryan Chan <ryanchan404@gmail.com> wrote:
>>
>>> In storm, you can control the number of thread with the setSpout/setBolt,
>>> and how to do the same with Spark Streaming?
>>>
>>> e.g.
>>>
>>> val lines = ssc.socketTextStream(args(1), args(2).toInt)
>>> val words = lines.flatMap(_.split(" "))
>>> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
>>> wordCounts.print()
>>> ssc.start()
>>>
>>>
>>> Sound like I cannot tell Spark to tell how many thread to be used with
>>> `flatMap` and how many thread to be used with `map` etc, right?
>>>
>>>
>>>
>>
>
>

Mime
View raw message