Hi Rishi,

there is no version as 2.4 :), can you please specify the exact SPARK version you are using? How are you starting the SPARK session? And what is the environment?

I know this issue occurs intermittently over large writes in S3 and has to do with S3 eventual consistency issues. Just restarting the job sometimes helps.


Regards,
Gourav Sengupta

On Thu, Aug 1, 2019 at 3:55 AM Rishi Shah <rishishah.star@gmail.com> wrote:
Hi All,

I have a dataframe of size 2.7T (parquet) which I need to partition by date, however below spark program doesn't help - keeps failing due to file already exists exception..

df = spark.read.parquet(INPUT_PATH)
df.repartition('date_field').write.partitionBy('date_field').mode('overwrite').parquet(PATH)

I did notice that couple of tasks failed and probably that's why it tried spinning up new ones which write to the same .staging directory?

--
Regards,

Rishi Shah