spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apostolos N. Papadopoulos" <papad...@csd.auth.gr>
Subject Re: writing a small csv to HDFS is super slow
Date Fri, 22 Mar 2019 21:54:04 GMT
Is it also slow when you do not repartition? (i.e., to get multiple 
output files)

Also did you try simply saveAsTextFile?

Also, before repartition, how many partitions are there?

a.


On 22/3/19 23:34, Lian Jiang wrote:
> Hi,
>
> Writing a csv to HDFS takes about 1 hour:
>
> df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header='true').save(csv)
>
> The generated csv file is only about 150kb. The job uses 3 containers 
> (13 cores, 23g mem).
>
> Other people have similar issues but I don't see a good explanation 
> and solution.
>
> Any clue is highly appreciated! Thanks.
>
>
-- 
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email: papadopo@csd.auth.gr
twitter: @papadopoulos_ap
web: http://datalab.csd.auth.gr/~apostol


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message