spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Syed, Nehal (Contractor)" <>
Subject Re: Kafka Spark Partition Mapping
Date Mon, 24 Aug 2015 17:21:08 GMT
Dear Cody,
Thanks for your response, I am trying to do decoration which means when a message comes from
Kafka (partitioned by key) in to the Spark I want to add more fields/data to it.
How Does normally people do it in Spark? If it were you how would you decorate message without
hitting database for every message?

Our current strategy is,  decoration data comes from local in Memory Cache (Guava LoadingCache)
and/or from SQL DB if not in cache.  If we take this approach we want cached decoration data
available locally to RDDs most of the time.
Our Kafka and Spark run on separate machines and thats why I just wants kafka-partition to
go to same Spark RDD partition most of time so I can utilized cached decoration Data.

Do you think if I Create JdbcRDD for d├ęcorarion data and join it with JavaPairReceiverInputDStream
it will always stays where JdbcRDD lives?


From: Cody Koeninger <<>>
Date: Thursday, August 20, 2015 at 6:33 PM
To: Microsoft Office User <<>>
Cc: "<>" <<>>
Subject: Re: Kafka Spark Partition Mapping

In general you cannot guarantee which node an RDD will be processed on.

The preferred location for a kafkardd is the kafka leader for that partition, if they're deployed
on the same machine. If you want to try to override that behavior, the method is getPreferredLocations

But even in that case, location preferences are just a scheduler hint, the rdd can still be
scheduled elsewhere.  You can turn up spark.locality.wait to a very high value to decrease
the likelihood.

On Thu, Aug 20, 2015 at 5:47 PM, nehalsyed <<>>
I have data in Kafka topic-partition and I am reading it from Spark like this: JavaPairReceiverInputDStream<String,
String> directKafkaStream = KafkaUtils.createDirectStream(streamingContext, [key class],
[value class], [key decoder class], [value decoder class], [map of Kafka parameters], [set
of topics to consume]); I want that message from a kafka partition always land on same machine
on Spark rdd so I can cache some decoration data locally and later reuse with other messages
(that belong to same key). Can anyone tell me how can I achieve it? Thanks
View this message in context: Kafka Spark Partition Mapping<>
Sent from the Apache Spark User List mailing list archive<>

View raw message