spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Takeshi Yamamuro <linguin....@gmail.com>
Subject Re: Cartesian join on RDDs taking too much time
Date Wed, 25 May 2016 08:02:45 GMT
Hi,

Seems you'd be better off using DataFrame#join instead of  RDD.cartesian
because it always needs shuffle operations which have alot of overheads
such as reflection, serialization, ...
In your case,  since the smaller table is 7mb, DataFrame#join uses a
broadcast strategy.
This is a little more efficient than  RDD.cartesian.

// maropu

On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

> It is basically a Cartesian join like RDBMS
>
> Example:
>
> SELECT * FROM FinancialCodes,  FinancialData
>
> The results of this query matches every row in the FinancialCodes table
> with every row in the FinancialData table.  Each row consists of all
> columns from the FinancialCodes table followed by all columns from the
> FinancialData table.
>
>
> Not very useful
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 25 May 2016 at 08:05, Priya Ch <learnings.chitturi@gmail.com> wrote:
>
>> Hi All,
>>
>>   I have two RDDs A and B where in A is of size 30 MB and B is of size 7
>> MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
>> cartesian operation ?
>>
>> I am using spark 1.6.0 version
>>
>> Regards,
>> Padma Ch
>>
>
>


-- 
---
Takeshi Yamamuro

Mime
View raw message