there is no version as 2.4 :), can you please specify the exact SPARK version you are using? How are you starting the SPARK session? And what is the environment?

I know this issue occurs intermittently over large writes in S3 and has to do with S3 eventual consistency issues. Just restarting the job sometimes helps.

I have a dataframe of size 2.7T (parquet) which I need to partition by date, however below spark program doesn't help - keeps failing due to file already exists exception..

df = spark.read.parquet(INPUT_PATH)

I did notice that couple of tasks failed and probably that's why it tried spinning up new ones which write to the same .staging directory?


