spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nib...@free.fr
Subject Re: Spark Streaming over YARN
Date Fri, 02 Oct 2015 16:10:31 GMT
Ok so if I set for example 4 receivers (number of nodes), how RDD will be distributed over
the nodes/core.
For example in my example I have 4 nodes (with 2 cores) 

Tks
Nicolas 


----- Mail original -----
De: "Dibyendu Bhattacharya" <dibyendu.bhattachary@gmail.com>
À: nibiau@free.fr
Cc: "Cody Koeninger" <cody@koeninger.org>, "user" <user@spark.apache.org>
Envoyé: Vendredi 2 Octobre 2015 18:01:59
Objet: Re: Spark Streaming over YARN


Hi, 


If you need to use Receiver based approach , you can try this one : https://github.com/dibbhatt/kafka-spark-consumer



This is also part of Spark packages : http://spark-packages.org/package/dibbhatt/kafka-spark-consumer



You just need to specify the number of Receivers you want for desired parallelism while receiving
, and rest of the thing will be taken care by ReceiverLauncher. 


This Low level Receiver will give better parallelism both on receiving , and on processing
the RDD. 


Default Receiver based API ( KafkaUtils.createStream) using Kafka High level API and Kafka
high Level API has serious issue to be used in production . 




Regards, 

Dibyendu 










On Fri, Oct 2, 2015 at 9:22 PM, < nibiau@free.fr > wrote: 


>From my understanding as soon as I use YARN I don't need to use parrallelisme (at least
for RDD treatment) 
I don't want to use direct stream as I have to manage the offset positionning (in order to
be able to start from the last offset treated after a spark job failure) 


----- Mail original ----- 
De: "Cody Koeninger" < cody@koeninger.org > 
À: "Nicolas Biau" < nibiau@free.fr > 
Cc: "user" < user@spark.apache.org > 
Envoyé: Vendredi 2 Octobre 2015 17:43:41 
Objet: Re: Spark Streaming over YARN 




If you're using the receiver based implementation, and want more parallelism, you have to
create multiple streams and union them together. 


Or use the direct stream. 


On Fri, Oct 2, 2015 at 10:40 AM, < nibiau@free.fr > wrote: 


Hello, 
I have a job receiving data from kafka (4 partitions) and persisting data inside MongoDB.

It works fine, but when I deploy it inside YARN cluster (4 nodes with 2 cores) only on node
is receiving all the kafka partitions and only one node is processing my RDD treatment (foreach
function) 
How can I force YARN to use all the resources nodes and cores to process the data (receiver
& RDD treatment) 

Tks a lot 
Nicolas 

--------------------------------------------------------------------- 
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org 
For additional commands, e-mail: user-help@spark.apache.org 



--------------------------------------------------------------------- 
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org 
For additional commands, e-mail: user-help@spark.apache.org 



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message