spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jungtaek Lim <kabhwan.opensou...@gmail.com>
Subject Re: Spark 3.0.1 Structured streaming - checkpoints fail
Date Wed, 23 Dec 2020 22:29:24 GMT
Please refer my previous answer -
https://lists.apache.org/thread.html/r7dfc9e47cd9651fb974f97dde756013fd0b90e49d4f6382d7a3d68f7%40%3Cuser.spark.apache.org%3E
Probably we may want to add it in the SS guide doc. We didn't need it as it
just didn't work with eventually consistent model, and now it works anyway
but is very inefficient.


On Thu, Dec 24, 2020 at 6:16 AM David Morin <morin.david.bzh@gmail.com>
wrote:

> Does it work with the standard AWS S3 solution and its new
> consistency model
> <https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/>
> ?
>
> Le mer. 23 déc. 2020 à 18:48, David Morin <morin.david.bzh@gmail.com> a
> écrit :
>
>> Thanks.
>> My Spark applications run on nodes based on docker images but this is a
>> standalone mode (1 driver - n workers)
>> Can we use S3 directly with consistency addon like s3guard (s3a) or AWS
>> Consistent view
>> <https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-consistent-view.html>
>>  ?
>>
>> Le mer. 23 déc. 2020 à 17:48, Lalwani, Jayesh <jlalwani@amazon.com> a
>> écrit :
>>
>>> Yes. It is necessary to have a distributed file system because all the
>>> workers need to read/write to the checkpoint. The distributed file system
>>> has to be immediately consistent: When one node writes to it, the other
>>> nodes should be able to read it immediately
>>>
>>> The solutions/workarounds depend on where you are hosting your Spark
>>> application.
>>>
>>>
>>>
>>> *From: *David Morin <morin.david.bzh@gmail.com>
>>> *Date: *Wednesday, December 23, 2020 at 11:08 AM
>>> *To: *"user@spark.apache.org" <user@spark.apache.org>
>>> *Subject: *[EXTERNAL] Spark 3.0.1 Structured streaming - checkpoints
>>> fail
>>>
>>>
>>>
>>> *CAUTION*: This email originated from outside of the organization. Do
>>> not click links or open attachments unless you can confirm the sender and
>>> know the content is safe.
>>>
>>>
>>>
>>> Hello,
>>>
>>>
>>>
>>> I have an issue with my Pyspark job related to checkpoint.
>>>
>>>
>>>
>>> Caused by: org.apache.spark.SparkException: Job aborted due to stage
>>> failure: Task 3 in stage 16997.0 failed 4 times, most recent failure: Lost
>>> task 3.3 in stage 16997.0 (TID 206609, 10.XXX, executor 4):
>>> java.lang.IllegalStateException: Error reading delta file
>>> file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta of
>>> HDFSStateStoreProvider[id = (op=0,part=3),dir =
>>> file:/opt/spark/workdir/query6/checkpointlocation/state/0/3]: *file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta
>>> does not exist*
>>>
>>>
>>>
>>> This job is based on Spark 3.0.1 and Structured Streaming
>>>
>>> This Spark cluster (1 driver and 6 executors) works without hdfs. And we
>>> don't want to manage an hdfs cluster if possible.
>>>
>>> Is it necessary to have a distributed filesystem ? What are the
>>> different solutions/workarounds ?
>>>
>>>
>>>
>>> Thanks in advance
>>>
>>> David
>>>
>>

Mime
View raw message