spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kelvin Qin" <>
Subject Re:[Structured Streaming] Checkpoint file compact file grows big
Date Thu, 16 Apr 2020 01:07:54 GMT

"Note that checkpointing of RDDs incurs the cost of saving to reliable storage. This may cause
an increase in the processing time of those batches where RDDs get checkpointed."

As far as I know, the official documentation states that the checkpoint of the spark streaming
application will continue to increase over time.
Whereas data or RDD checkpointing is necessary even for basic functioning if stateful transformations
are used.
So,for applications that require long-term aggregation, I choose to use third-party caches
in production, such as redis. Maybe you can try Alluxio


在 2020-04-16 08:19:24,"Ahn, Daniel" <> 写道:

Are Spark Structured Streaming checkpoint files expected to grow over time indefinitely? Is
there a recommended way to safely age-off old checkpoint data?


Currently we have a Spark Structured Streaming process reading from Kafka and writing to an
HDFS sink, with checkpointing enabled and writing to a location on HDFS. This streaming application
has been running for 4 months and over time we have noticed that with every 10th job within
the application there is about a 5 minute delay between when a job finishes and the next job
starts which we have attributed to the checkpoint compaction process. At this point the .compact
file that is written is about 2GB in size and the contents of the file show it keeps track
of files it processed at the very origin of the streaming application.


This issue can be reproduced with any Spark Structured Streaming process that writes checkpoint


Is the best approach for handling the growth of these files to simply delete the latest .compact
file within the checkpoint directory, and are there any associated risks with doing so?


This e-mail, including attachments, may include confidential and/or
proprietary information, and may be used only by the person or entity
to which it is addressed. If the reader of this e-mail is not the intended
recipient or his or her authorized agent, the reader is hereby notified
that any dissemination, distribution or copying of this e-mail is
prohibited. If you have received this e-mail in error, please notify the
sender by replying to this message and delete this e-mail immediately.
View raw message