spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Morin <morin.david....@gmail.com>
Subject Re: Spark 3.0.1 Structured streaming - checkpoints fail
Date Wed, 23 Dec 2020 21:15:50 GMT
Does it work with the standard AWS S3 solution and its new consistency model
<https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/>
?

Le mer. 23 déc. 2020 à 18:48, David Morin <morin.david.bzh@gmail.com> a
écrit :

> Thanks.
> My Spark applications run on nodes based on docker images but this is a
> standalone mode (1 driver - n workers)
> Can we use S3 directly with consistency addon like s3guard (s3a) or AWS
> Consistent view
> <https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-consistent-view.html>
>  ?
>
> Le mer. 23 déc. 2020 à 17:48, Lalwani, Jayesh <jlalwani@amazon.com> a
> écrit :
>
>> Yes. It is necessary to have a distributed file system because all the
>> workers need to read/write to the checkpoint. The distributed file system
>> has to be immediately consistent: When one node writes to it, the other
>> nodes should be able to read it immediately
>>
>> The solutions/workarounds depend on where you are hosting your Spark
>> application.
>>
>>
>>
>> *From: *David Morin <morin.david.bzh@gmail.com>
>> *Date: *Wednesday, December 23, 2020 at 11:08 AM
>> *To: *"user@spark.apache.org" <user@spark.apache.org>
>> *Subject: *[EXTERNAL] Spark 3.0.1 Structured streaming - checkpoints fail
>>
>>
>>
>> *CAUTION*: This email originated from outside of the organization. Do
>> not click links or open attachments unless you can confirm the sender and
>> know the content is safe.
>>
>>
>>
>> Hello,
>>
>>
>>
>> I have an issue with my Pyspark job related to checkpoint.
>>
>>
>>
>> Caused by: org.apache.spark.SparkException: Job aborted due to stage
>> failure: Task 3 in stage 16997.0 failed 4 times, most recent failure: Lost
>> task 3.3 in stage 16997.0 (TID 206609, 10.XXX, executor 4):
>> java.lang.IllegalStateException: Error reading delta file
>> file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta of
>> HDFSStateStoreProvider[id = (op=0,part=3),dir =
>> file:/opt/spark/workdir/query6/checkpointlocation/state/0/3]: *file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta
>> does not exist*
>>
>>
>>
>> This job is based on Spark 3.0.1 and Structured Streaming
>>
>> This Spark cluster (1 driver and 6 executors) works without hdfs. And we
>> don't want to manage an hdfs cluster if possible.
>>
>> Is it necessary to have a distributed filesystem ? What are the different
>> solutions/workarounds ?
>>
>>
>>
>> Thanks in advance
>>
>> David
>>
>

Mime
View raw message