spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cody Koeninger <c...@koeninger.org>
Subject Re: Kafka + Spark streaming, RDD partitions not processed in parallel
Date Fri, 11 Mar 2016 15:12:10 GMT
Can you post your actual code?

On Thu, Mar 10, 2016 at 9:55 PM, Mukul Gupta <mukul.gupta@aricent.com> wrote:
> Hi All, I was running the following test: Setup 9 VM runing spark workers
> with 1 spark executor each. 1 VM running kafka and spark master. Spark
> version is 1.6.0 Kafka version is 0.9.0.1 Spark is using its own resource
> manager and is not running over YARN. Test I created a kafka topic with 3
> partition. next I used "KafkaUtils.createDirectStream" to get a DStream.
> JavaPairInputDStream<String, String> stream =
> KafkaUtils.createDirectStream(…); JavaDStream stream1 = stream.map(func1);
> stream1.print(); where func1 just contains a sleep followed by returning of
> value. Observation First RDD partition corresponding to partition 1 of kafka
> was processed on one of the spark executor. Once processing is finished,
> then RDD partitions corresponding to remaining two kafka partitions were
> processed in parallel on different spark executors. I expected that all
> three RDD partitions should have been processed in parallel as there were
> spark executors available which were lying idle. I re-ran the test after
> increasing the partitions of kafka topic to 5. This time also RDD partition
> corresponding to partition 1 of kafka was processed on one of the spark
> executor. Once processing is finished for this RDD partition, then RDD
> partitions corresponding to remaining four kafka partitions were processed
> in parallel on different spark executors. I am not clear about why spark is
> waiting for operations on first RDD partition to finish, while it could
> process remaining partitions in parallel? Am I missing any configuration?
> Any help is appreciated. Thanks, Mukul
> ________________________________
> View this message in context: Kafka + Spark streaming, RDD partitions not
> processed in parallel
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message