spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Tanase <atan...@adobe.com>
Subject Re: Spark Streaming data checkpoint performance
Date Mon, 02 Nov 2015 15:37:49 GMT
You are correct, the default checkpointing interval is 10 seconds or your batch size, whichever
is bigger. You can change it by calling .checkpoint(x) on your resulting Dstream.

For the rest, you are probably keeping an “all time” word count that grows unbounded if
you never remove words from the map. Keep in mind that updateStateByKey is called for every
key in the state RDD, regardless if you have new occurrences or not.

You should consider at least one of these strategies:

  *   run your word count on a windowed Dstream (e.g. Unique counts over the last 15 minutes)
     *   Your best bet here is reduceByKeyAndWindow with an inverse function
  *   Make your state object more complicated and try to prune out words with very few occurrences
or that haven’t been updated for a long time
     *   You can do this by emitting None from updateStateByKey

Hope this helps,
-adrian

From: Thúy Hằng Lê
Date: Monday, November 2, 2015 at 7:20 AM
To: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Spark Streaming data checkpoint performance

JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
Mime
View raw message