spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Boesch <java...@gmail.com>
Subject Re: Hash Partitioning and Dataframes
Date Sat, 18 Jul 2015 04:22:29 GMT
Has there been any further work (or at least consideration) on joins
working on local data - via copartitioning?

2015-05-09 11:39 GMT-07:00 Michael Armbrust <michael@databricks.com>:

> Ah, unfortunately that is not possible today as Catalyst has a logical
> notion of partitioning that is different than that exposed by the RDD.
>
> A researcher at Databricks is considering allowing this kind of
> optimization for in memory cached relations though.  Here is a WIP patch:
>
> https://github.com/yhuai/spark/commit/aa4448ac801f833f7c8217fdce30cba9c803a877
>
> On Fri, May 8, 2015 at 4:06 PM, Daniel, Ronald (ELS-SDG) <
> R.Daniel@elsevier.com> wrote:
>
>>  Just trying to make sure that something I know in advance (the joins
>> will always have an equality test on one specific field) is used to
>> optimize the partitioning so the joins only use local data.
>>
>>
>>
>> Thanks for the info.
>>
>>
>>
>> Ron
>>
>>
>>
>>
>>
>> *From:* Michael Armbrust [mailto:michael@databricks.com]
>> *Sent:* Friday, May 08, 2015 3:15 PM
>> *To:* Daniel, Ronald (ELS-SDG)
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Hash Partitioning and Dataframes
>>
>>
>>
>> What are you trying to accomplish?  Internally Spark SQL will add
>> Exchange operators to make sure that data is partitioned correctly for
>> joins and aggregations.  If you are going to do other RDD operations on the
>> result of dataframe operations and you need to manually control the
>> partitioning, call df.rdd and partition as you normally would.
>>
>>
>>
>> On Fri, May 8, 2015 at 2:47 PM, Daniel, Ronald (ELS-SDG) <
>> R.Daniel@elsevier.com> wrote:
>>
>> Hi,
>>
>> How can I ensure that a batch of DataFrames I make are all partitioned
>> based on the value of one column common to them all?
>> For RDDs I would partitionBy a HashPartitioner, but I don't see that in
>> the DataFrame API.
>> If I partition the RDDs that way, then do a toDF(), will the partitioning
>> be preserved?
>>
>> Thanks,
>> Ron
>>
>>
>>
>
>

Mime
View raw message