spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dibyendu Bhattacharya <dibyendu.bhattach...@gmail.com>
Subject Re: Spark Streaming over YARN
Date Fri, 02 Oct 2015 16:01:59 GMT
Hi,

If you need to use Receiver based approach , you can try this one :
https://github.com/dibbhatt/kafka-spark-consumer

This is also part of Spark packages :
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer

You just need to specify the number of Receivers you want for desired
parallelism while receiving , and rest of the thing will be taken care by
ReceiverLauncher.

This Low level Receiver  will give better parallelism both on receiving ,
and on processing the RDD.

Default Receiver based API ( KafkaUtils.createStream) using Kafka High
level API and Kafka high Level API has serious issue to be used in
production .


Regards,
Dibyendu





On Fri, Oct 2, 2015 at 9:22 PM, <nibiau@free.fr> wrote:

> From my understanding as soon as I use YARN I don't need to use
> parrallelisme (at least for RDD treatment)
> I don't want to use direct stream as I have to manage the offset
> positionning (in order to be able to start from the last offset treated
> after a spark job failure)
>
>
> ----- Mail original -----
> De: "Cody Koeninger" <cody@koeninger.org>
> À: "Nicolas Biau" <nibiau@free.fr>
> Cc: "user" <user@spark.apache.org>
> Envoyé: Vendredi 2 Octobre 2015 17:43:41
> Objet: Re: Spark Streaming over YARN
>
>
> If you're using the receiver based implementation, and want more
> parallelism, you have to create multiple streams and union them together.
>
>
> Or use the direct stream.
>
>
> On Fri, Oct 2, 2015 at 10:40 AM, < nibiau@free.fr > wrote:
>
>
> Hello,
> I have a job receiving data from kafka (4 partitions) and persisting data
> inside MongoDB.
> It works fine, but when I deploy it inside YARN cluster (4 nodes with 2
> cores) only on node is receiving all the kafka partitions and only one node
> is processing my RDD treatment (foreach function)
> How can I force YARN to use all the resources nodes and cores to process
> the data (receiver & RDD treatment)
>
> Tks a lot
> Nicolas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message