spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jorge Sánchez <jorgesg1...@gmail.com>
Subject Re: dataframes and numPartitions
Date Sun, 18 Oct 2015 19:46:46 GMT
Alex,

If not, you can try using the functions coalesce(n) or repartition(n).

As per the API, coalesce will not make a shuffle but repartition will.

Regards.

2015-10-16 0:52 GMT+01:00 Mohammed Guller <mohammed@glassbeam.com>:

> You may find the spark.sql.shuffle.partitions property useful. The default
> value is 200.
>
>
>
> Mohammed
>
>
>
> *From:* Alex Nastetsky [mailto:alex.nastetsky@vervemobile.com]
> *Sent:* Wednesday, October 14, 2015 8:14 PM
> *To:* user
> *Subject:* dataframes and numPartitions
>
>
>
> A lot of RDD methods take a numPartitions parameter that lets you specify
> the number of partitions in the result. For example, groupByKey.
>
>
>
> The DataFrame counterparts don't have a numPartitions parameter, e.g.
> groupBy only takes a bunch of Columns as params.
>
>
>
> I understand that the DataFrame API is supposed to be smarter and go
> through a LogicalPlan, and perhaps determine the number of optimal
> partitions for you, but sometimes you want to specify the number of
> partitions yourself. One such use case is when you are preparing to do a
> "merge" join with another dataset that is similarly partitioned with the
> same number of partitions.
>

Mime
View raw message