spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thúy Hằng Lê <thuyhang...@gmail.com>
Subject Spark Streaming data checkpoint performance
Date Mon, 02 Nov 2015 05:20:46 GMT
Hi Spark guru

I am evaluating Spark Streaming,

In my application I need to maintain cumulative statistics (e.g the total
running word count), so I need to call the updateStateByKey function on
very micro-batch.

After setting those things, I got following behaviors:
        * The Processing Time is very high every 10 seconds - usually 5x
higher (which I guess it's data checking point job)
        * The Processing Time becomes higher and higher over time, after 10
minutes it's much higher than the batch interval and lead to huge
Scheduling Delay and a lots Active Batches in queue.

My questions is:

         * Is this expected behavior? Is there any way to improve the
performance of data checking point?
         * How data checking point in Spark Streaming works? Does it need
to load all previous checking point data in order to build new one?

My job is very simple:

JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
Durations.seconds(2));
JavaPairInputDStream<String, String> messages =
KafkaUtils.createDirectStream(...);

JavaPairDStream<String, List<Double>> stats = messages.mapToPair(parseJson)
.reduceByKey(REDUCE_STATS) .updateStateByKey(RUNNING_STATS);

stats.print()

Mime
View raw message