spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cody Koeninger <c...@koeninger.org>
Subject Re: spark streaming doubt
Date Mon, 13 Jul 2015 13:52:32 GMT
Regarding your first question, having more partitions than you do executors
usually means you'll have better utilization, because the workload will be
distributed more evenly.  There's some degree of per-task overhead, but as
long as you don't have a huge imbalance between number of tasks and number
of executors that shouldn't be a large problem.

I don't really understand your second question.

On Sat, Jul 11, 2015 at 5:00 AM, Shushant Arora <shushantarora09@gmail.com>
wrote:

> 1.spark streaming 1.3 creates as many RDD Partitions as there are kafka
> partitions in topic. Say I have 300 partitions in topic and 10 executors
> and each with 3 cores so , is it means at a time only 10*3=30 partitions
> are processed and then 30 like that since executors launch tasks per RDD
> partitions , so I need in total; 300 tasks but since I have 30 cores(10
> executors each with 3 cores) so these tasks will execute 30 after 30 till
> 300.
>
> So reducing no of kafka paartitions to say 100 will speed up the
> processing?
>
> 2.In spark streaming job when I processed the kafka stream using foreachRDD
>
> directKafkaStream.foreachRDD(new function( public void call(  vi)){
> v1.foreachPartition(new function(){public void call(){
> //..process partition
> }})
>
> });
>
> since foreachRDD is operation so it spawns spark job but these jobs are
> not coming on driver console like in map and print function as
>
> 1.spark streaming 1.3 creates as many RDD Partitions as there are kafka
> partitions in topic. Say I have 300 partitions in topic and 10 executors
> and each with 3 cores so , is it means at a time only 10*3=30 partitions
> are processed and then 30 like that since executors launch tasks per RDD
> partitions , so I need in total; 300 tasks but since I have 30 cores(10
> executors each with 3 cores) so these tasks will execute 30 after 30 till
> 300.
>
> So reducing no of kafka paartitions to say 100 will speed up the
> processing?
>
> 2.In spark streaming job when I processed the kafka stream using foreachRDD
>
> directKafkaStream.foreachRDD(new function( public void call(  vi)){
> v1.foreachPartition(new function(){public void call(){
> //..process partition
> }})
>
> });
>
> since foreachRDD is operation so it spawns spark job but these jobs
> timings are not coming on driver console like in map and print function as
>
>
> -------------------------------------------
> Time: 1429054870000 ms
> -------------------------------------------
> ------------------------------------------
> Time: 1429054871000 ms
> -------------------------------------------
>
> ..................
>
> Why is it so?
>
>
> Thanks
> Shushant
>
>
>
>
>
>

Mime
View raw message