spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheng, Hao" <>
Subject RE: Spark DataFrames uses too many partition
Date Wed, 12 Aug 2015 01:19:01 GMT
That's a good question, we don't support reading small files in a single partition yet, but
it's definitely an issue we need to optimize, do you mind to create a jira issue for this?
Hopefully we can merge that in 1.6 release.

200 is the default partition number for parallel tasks after the data shuffle, and we have
to change that value according to the file size, cluster size etc..

Ideally, this number would be set dynamically and automatically, however, spark sql doesn't
support the complex cost based model yet, particularly for the multi-stages job. (


-----Original Message-----
From: Al M [] 
Sent: Tuesday, August 11, 2015 11:31 PM
Subject: Spark DataFrames uses too many partition

I am using DataFrames with Spark 1.4.1.  I really like DataFrames but the partitioning makes
no sense to me.

I am loading lots of very small files and joining them together.  Every file is loaded by
Spark with just one partition.  Each time I join two small files the partition count increases
to 200.  This makes my application take 10x as long as if I coalesce everything to 1 partition
after each join.

With normal RDDs it would not expand out the partitions to 200 after joining two files with
one partition each.  It would either keep it at one or expand it to two.

Why do DataFrames expand out the partitions so much?

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail: For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message