spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From JF Chen <darou...@gmail.com>
Subject Re: spark df.write.partitionBy run very slow
Date Wed, 06 Mar 2019 03:03:00 GMT
Hi Shyam
Thanks for your reply.
You mean after knowing the partition number of column_a, column_b,
column_c, the sequence of column in partitionBy should be same to the order
of partitions number of column a, b and c?
But the sequence of columns in  partitionBy  decides the
directory  hierarchy structure. I hope the sequence of columns not change.

And I found one more strange things, some tasks in write hdfs stage cost
much more time than others, where the amount of writing data is similar.
How to solve it?

Regard,
Junfeng Chen


On Tue, Mar 5, 2019 at 3:05 PM Shyam P <shyamabigdata@gmail.com> wrote:

> Hi JF ,
>  Try to execute it before df.write....
>
> //count by partition_id
>         import org.apache.spark.sql.functions.spark_partition_id
>         df.groupBy(spark_partition_id).count.show()
>
> You will come to know how data has been partitioned inside df.
>
> Small trick we can apply here while partitionBy(column_a, column_b,
> column_c)
> Makes sure  you should have ( column_a  partitions) > ( column_b
> partitions) >  ( column_c  partitions) .
>
> Try this.
>
> Regards,
> Shyam
>
> On Mon, Mar 4, 2019 at 4:09 PM JF Chen <darouwan@gmail.com> wrote:
>
>> I am trying to write data in dataset to hdfs via df.write.partitionBy(column_a,
>> column_b, column_c).parquet(output_path)
>> However, it costs several minutes to write only hundreds of MB data to
>> hdfs.
>> From this article
>> <https://stackoverflow.com/questions/45269658/spark-df-write-partitionby-run-very-slow>,
>> adding repartition method before write should work. But if there is data
>> skew, some tasks may cost much longer time than average, which still cost
>> much time.
>> How to solve this problem? Thanks in advance !
>>
>>
>> Regard,
>> Junfeng Chen
>>
>

Mime
View raw message