spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shyam P <shyamabigd...@gmail.com>
Subject Re: spark df.write.partitionBy run very slow
Date Tue, 05 Mar 2019 07:05:32 GMT
Hi JF ,
 Try to execute it before df.write....

//count by partition_id
        import org.apache.spark.sql.functions.spark_partition_id
        df.groupBy(spark_partition_id).count.show()

You will come to know how data has been partitioned inside df.

Small trick we can apply here while partitionBy(column_a, column_b,
column_c)
Makes sure  you should have ( column_a  partitions) > ( column_b
partitions) >  ( column_c  partitions) .

Try this.

Regards,
Shyam

On Mon, Mar 4, 2019 at 4:09 PM JF Chen <darouwan@gmail.com> wrote:

> I am trying to write data in dataset to hdfs via df.write.partitionBy(column_a,
> column_b, column_c).parquet(output_path)
> However, it costs several minutes to write only hundreds of MB data to
> hdfs.
> From this article
> <https://stackoverflow.com/questions/45269658/spark-df-write-partitionby-run-very-slow>,
> adding repartition method before write should work. But if there is data
> skew, some tasks may cost much longer time than average, which still cost
> much time.
> How to solve this problem? Thanks in advance !
>
>
> Regard,
> Junfeng Chen
>

Mime
View raw message