spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rishi Shah <rishishah.s...@gmail.com>
Subject [Pyspark 2.4] not able to partition the data frame by dates
Date Thu, 01 Aug 2019 02:55:04 GMT
Hi All,

I have a dataframe of size 2.7T (parquet) which I need to partition by
date, however below spark program doesn't help - keeps failing due to *file
already exists exception..*

df = spark.read.parquet(INPUT_PATH)
df.repartition('date_field').write.partitionBy('date_field').mode('overwrite').parquet(PATH)

I did notice that couple of tasks failed and probably that's why it tried
spinning up new ones which write to the same .staging directory?

-- 
Regards,

Rishi Shah

Mime
View raw message