spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: [Pyspark 2.4] not able to partition the data frame by dates
Date Thu, 01 Aug 2019 02:58:00 GMT
Hi Rishi,

there is no version as 2.4 :), can you please specify the exact SPARK
version you are using? How are you starting the SPARK session? And what is
the environment?

I know this issue occurs intermittently over large writes in S3 and has to
do with S3 eventual consistency issues. Just restarting the job sometimes
helps.


Regards,
Gourav Sengupta

On Thu, Aug 1, 2019 at 3:55 AM Rishi Shah <rishishah.star@gmail.com> wrote:

> Hi All,
>
> I have a dataframe of size 2.7T (parquet) which I need to partition by
> date, however below spark program doesn't help - keeps failing due to *file
> already exists exception..*
>
> df = spark.read.parquet(INPUT_PATH)
>
> df.repartition('date_field').write.partitionBy('date_field').mode('overwrite').parquet(PATH)
>
> I did notice that couple of tasks failed and probably that's why it tried
> spinning up new ones which write to the same .staging directory?
>
> --
> Regards,
>
> Rishi Shah
>

Mime
View raw message