spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alasdair McBride <alasdair.mcbr...@gmail.com>
Subject RE: Spark DataFrames uses too many partition
Date Wed, 12 Aug 2015 08:02:53 GMT
Thank you Hao; that was a fantastic response. I have raised SPARK-9782 for
this.

I also would love to have dynamic partitioning. I mentioned it in the Jira.
On 12 Aug 2015 02:19, "Cheng, Hao" <hao.cheng@intel.com> wrote:

> That's a good question, we don't support reading small files in a single
> partition yet, but it's definitely an issue we need to optimize, do you
> mind to create a jira issue for this? Hopefully we can merge that in 1.6
> release.
>
> 200 is the default partition number for parallel tasks after the data
> shuffle, and we have to change that value according to the file size,
> cluster size etc..
>
> Ideally, this number would be set dynamically and automatically, however,
> spark sql doesn't support the complex cost based model yet, particularly
> for the multi-stages job. (
> https://issues.apache.org/jira/browse/SPARK-4630)
>
> Hao
>
> -----Original Message-----
> From: Al M [mailto:alasdair.mcbride@gmail.com]
> Sent: Tuesday, August 11, 2015 11:31 PM
> To: user@spark.apache.org
> Subject: Spark DataFrames uses too many partition
>
> I am using DataFrames with Spark 1.4.1.  I really like DataFrames but the
> partitioning makes no sense to me.
>
> I am loading lots of very small files and joining them together.  Every
> file is loaded by Spark with just one partition.  Each time I join two
> small files the partition count increases to 200.  This makes my
> application take 10x as long as if I coalesce everything to 1 partition
> after each join.
>
> With normal RDDs it would not expand out the partitions to 200 after
> joining two files with one partition each.  It would either keep it at one
> or expand it to two.
>
> Why do DataFrames expand out the partitions so much?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DataFrames-uses-too-many-partition-tp24214.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional
> commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message