spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaditya Ramesh (JIRA)" <>
Subject [jira] [Commented] (SPARK-19525) Enable Compression of Spark Streaming Checkpoints
Date Fri, 17 Feb 2017 20:10:42 GMT


Aaditya Ramesh commented on SPARK-19525:

[~zsxwing] Actually, we are compressing the data in the RDDs, not the streaming metadata.
We compress all records in a partition together and write them to our DFS. In our case, the
snappy-compressed size of each RDD partition is around 18 MB, with 84 partitions, for a total
of 1.5 GB per RDD.

> Enable Compression of Spark Streaming Checkpoints
> -------------------------------------------------
>                 Key: SPARK-19525
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 2.1.0
>            Reporter: Aaditya Ramesh
> In our testing, compressing partitions while writing them to checkpoints on HDFS using
snappy helped performance significantly while also reducing the variability of the checkpointing
operation. In our tests, checkpointing time was reduced by 3X, and variability was reduced
by 2X for data sets of compressed size approximately 1 GB.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message