spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Pascal Billaud>
Subject DStream demultiplexer based on a key
Date Sun, 14 Dec 2014 18:50:45 GMT

I am doing an experiment with Spark Streaming consisting of moving data
from Kafka to S3 locations while partitioning by date. I have already
looked into Linked Camus and Pinterest Secor and while both are workable
solutions, it just feels that Spark Streaming should be able to be on par
with those without having to manage yet another application in our stack
since we already have a Spark Streaming cluster in production.

So what I am trying to do is very simple really. Each message in Kafka is
thrift serialized, and the corresponding thrift objects have a timestamp
field. What I'd like is to do is something like that:

JavaPairDStream stream = KafkaUtils.createRawStream(...)
stream = PairFunction<Tuple2<Void, Log>, String, Log> {
  public Tuple2<String, Log> call(Tuple2<Void, Log> tuple) {
    return new Tuple2<>(tuple._2().getDate(), tuple._2());

At this point, I'd like to do some partitioning on the resulting DStream to
have multiple DStream each with a single common string Date... So for
instance in one DStream I would have all the entries from 12/01 and on
another the entries from 12/02. Once I have this list of DStream, for each
of them I would call saveAsObjectFiles() basically. I unfortunately did not
find a way to demultiplex DStream based on a key. Obviously the reduce
operation families does some of that but the result is still a single

An alternative approach would be to call forEachRDD() on the DStream and
demultiplex the entries into multiple new RDDs based on the timestamp to
bucketize the entries with the same day date in the same RDD and finally
call saveAsObjectFiles(). I am not sure if I can use parallelize() to
create those RDDs?

Another thing that I am gonna be experimenting with is to use much longer
batching interval. I am talking in minutes because I don't want to have
bunch of tiny files. I might simply use a bigger Duration or use one of the
window operation. Not sure if anybody tries running Spark Streaming in that

Any thoughts on that would be much appreciated,


View raw message