spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: Control number of parquet generated from JavaSchemaRDD
Date Wed, 26 Nov 2014 01:55:10 GMT
I believe coalesce(..., true) and repartition are the same.  If the input
files are of similar sizes, then coalesce will be cheaper as it introduces a
narrow dependency
<https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf>,
meaning there won't be a shuffle.  However, if there is a lot of skew in
the input file size, then a repartition will ensure that data is shuffled
evenly.

There is currently no way to control the file size other than pick a 'good'
number of partitions.

On Tue, Nov 25, 2014 at 11:30 AM, tridib <tridib.samanta@live.com> wrote:

> Thanks Michael,
> It worked like a charm! I have few more queries:
> 1. Is there a way to control the size of parquet file?
> 2. Which method do you recommend coalesce(n, true), coalesce(n, false) or
> repartition(n)?
>
> Thanks & Regards
> Tridib
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Control-number-of-parquet-generated-from-JavaSchemaRDD-tp19717p19789.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message