spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From François Pelletier <newslett...@francoispelletier.org>
Subject Re: Ahhhh... Spark creates >30000 partitions... What can I do?
Date Tue, 20 Oct 2015 13:03:48 GMT
You should aggregate your files in larger chunks before doing anything
else. HDFS is not fit for small files. It will bloat it and cause you a
lot of performance issues. Target a few hundred MB chunks partition size
and then save those files back to hdfs and then delete the original
ones. You can read, use coalesce and the saveAsXXX on the result.

I had the same kind of problem once and solved it in bunching 100's of
files together in larger ones. I used text files with bzip2 compression.



Le 2015-10-20 08:42, Sean Owen a écrit :
> coalesce without a shuffle? it shouldn't be an action. It just treats
> many partitions as one.
>
> On Tue, Oct 20, 2015 at 1:00 PM, t3l <t3l@threelights.de
> <mailto:t3l@threelights.de>> wrote:
>
>
>     I have dataset consisting of 50000 binary files (each between
>     500kb and
>     2MB). They are stored in HDFS on a Hadoop cluster. The datanodes
>     of the
>     cluster are also the workers for Spark. I open the files as a RDD
>     using
>     sc.binaryFiles("hdfs:///path_to_directory").When I run the first
>     action that
>     involves this RDD, Spark spawns a RDD with more than 30000
>     partitions. And
>     this takes ages to process these partitions even if you simply run
>     "count".
>     Performing a "repartition" directly after loading does not help,
>     because
>     Spark seems to insist on materializing the RDD created by
>     binaryFiles first.
>
>     How I can get around this?
>
>
>
>     --
>     View this message in context:
>     http://apache-spark-user-list.1001560.n3.nabble.com/Ahhhh-Spark-creates-30000-partitions-What-can-I-do-tp25140.html
>     Sent from the Apache Spark User List mailing list archive at
>     Nabble.com.
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>     <mailto:user-unsubscribe@spark.apache.org>
>     For additional commands, e-mail: user-help@spark.apache.org
>     <mailto:user-help@spark.apache.org>
>
>


Mime
View raw message