spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolae Marasoiu <nicolae.maras...@adswizz.com>
Subject Re: Kafka Direct Stream
Date Thu, 01 Oct 2015 19:43:50 GMT
Hi,


If you just need processing per topic, why not generate N different kafka direct streams ?
when creating a kafka direct stream you have list of topics - just give one.


Then the reusable part of your computations should be extractable as transformations/functions
and reused between the streams.


Nicu


________________________________
From: Adrian Tanase <atanase@adobe.com>
Sent: Thursday, October 1, 2015 5:47 PM
To: Cody Koeninger; Udit Mehta
Cc: user
Subject: Re: Kafka Direct Stream

On top of that you could make the topic part of the key (e.g. keyBy in .transform or manually
emitting a tuple) and use one of the .xxxByKey operators for the processing.

If you have a stable, domain specific list of topics (e.g. 3-5 named topics) and the processing
is really different, I would also look at filtering by topic and saving as different Dstreams
in your code.

Either way you need to start with Cody's tip in order to extract the topic name.

-adrian

From: Cody Koeninger
Date: Thursday, October 1, 2015 at 5:06 PM
To: Udit Mehta
Cc: user
Subject: Re: Kafka Direct Stream

You can get the topic for a given partition from the offset range.  You can either filter
using that; or just have a single rdd and match on topic when doing mapPartitions or foreachPartition
(which I think is a better idea)

http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers
[http://spark.apache.org/docs/latest/img/spark-logo-hd.png]<http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers>

Spark Streaming + Kafka Integration Guide - Spark 1.5.0 ...
Spark Streaming + Kafka Integration Guide. Apache Kafka is publish-subscribe messaging rethought
as a distributed, partitioned, replicated commit log service.
Read more...<http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers>




On Wed, Sep 30, 2015 at 5:02 PM, Udit Mehta <umehta@groupon.com<mailto:umehta@groupon.com>>
wrote:
Hi,

I am using spark direct stream to consume from multiple topics in Kafka. I am able to consume
fine but I am stuck at how to separate the data for each topic since I need to process data
differently depending on the topic.
I basically want to split the RDD consisting on N topics into N RDD's each having 1 topic.

Any help would be appreciated.

Thanks in advance,
Udit


Mime
View raw message