spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <t...@databricks.com>
Subject Re: spark streaming 1.3 coalesce on kafkadirectstream
Date Wed, 22 Jul 2015 06:09:57 GMT
With DirectKafkaStream there are two approaches.
1. you increase the number of KAfka partitions Spark will automatically
read in parallel
2. if that's not possible, then explicitly repartition only if there are
more cores in the cluster than the number of Kafka partitions, AND the
first map-like state on the directKafkaDStream is heavy enough to warrant
the cost of repartitioning (shuffles the data around).

On Mon, Jul 20, 2015 at 8:31 PM, Shushant Arora <shushantarora09@gmail.com>
wrote:

> does spark streaming 1.3 launches task for each partition offset range
> whether that is 0 or not ?
>
> If yes, how can I enforce it to not to launch tasks for empty rdds.Not
> able t o use coalesce on directKafkaStream.
>
> Shall we enforce repartitioning always before processing direct stream ?
>
> use case is :
>
> directKafkaStream.repartition(numexecutors).mapPartitions(new
> FlatMapFunction<Iterator<Tuple2<byte[],byte[]>>, String>(){
> ...
> }
>
> Thanks
>

Mime
View raw message