spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lalwani, Jayesh" <>
Subject Re: Spark 3.0.1 Structured streaming - checkpoints fail
Date Wed, 23 Dec 2020 16:48:44 GMT
Yes. It is necessary to have a distributed file system because all the workers need to read/write
to the checkpoint. The distributed file system has to be immediately consistent: When one
node writes to it, the other nodes should be able to read it immediately
The solutions/workarounds depend on where you are hosting your Spark application.

From: David Morin <>
Date: Wednesday, December 23, 2020 at 11:08 AM
To: "" <>
Subject: [EXTERNAL] Spark 3.0.1 Structured streaming - checkpoints fail

CAUTION: This email originated from outside of the organization. Do not click links or open
attachments unless you can confirm the sender and know the content is safe.


I have an issue with my Pyspark job related to checkpoint.

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage
16997.0 failed 4 times, most recent failure: Lost task 3.3 in stage 16997.0 (TID 206609, 10.XXX,
executor 4): java.lang.IllegalStateException: Error reading delta file file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/
of HDFSStateStoreProvider[id = (op=0,part=3),dir = file:/opt/spark/workdir/query6/checkpointlocation/state/0/3]:
file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/ does not exist

This job is based on Spark 3.0.1 and Structured Streaming
This Spark cluster (1 driver and 6 executors) works without hdfs. And we don't want to manage
an hdfs cluster if possible.
Is it necessary to have a distributed filesystem ? What are the different solutions/workarounds

Thanks in advance
View raw message