spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Takeshi Yamamuro <linguin....@gmail.com>
Subject Re: Cartesian join on RDDs taking too much time
Date Wed, 25 May 2016 10:17:12 GMT
Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as
parquet, orc, ...?

// maropu

On Wed, May 25, 2016 at 7:10 PM, Priya Ch <learnings.chitturi@gmail.com>
wrote:

> Hi , Yes I have joined using DataFrame join. Now to save this into hdfs .I
> am converting the joined dataframe to rdd (dataframe.rdd) and using
> saveAsTextFile, trying to save it. However, this is also taking too much
> time.
>
> Thanks,
> Padma Ch
>
> On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro <linguin.m.s@gmail.com>
> wrote:
>
>> Hi,
>>
>> Seems you'd be better off using DataFrame#join instead of  RDD.cartesian
>> because it always needs shuffle operations which have alot of overheads
>> such as reflection, serialization, ...
>> In your case,  since the smaller table is 7mb, DataFrame#join uses a
>> broadcast strategy.
>> This is a little more efficient than  RDD.cartesian.
>>
>> // maropu
>>
>> On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> It is basically a Cartesian join like RDBMS
>>>
>>> Example:
>>>
>>> SELECT * FROM FinancialCodes,  FinancialData
>>>
>>> The results of this query matches every row in the FinancialCodes table
>>> with every row in the FinancialData table.  Each row consists of all
>>> columns from the FinancialCodes table followed by all columns from the
>>> FinancialData table.
>>>
>>>
>>> Not very useful
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 25 May 2016 at 08:05, Priya Ch <learnings.chitturi@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>>   I have two RDDs A and B where in A is of size 30 MB and B is of size
>>>> 7 MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
>>>> cartesian operation ?
>>>>
>>>> I am using spark 1.6.0 version
>>>>
>>>> Regards,
>>>> Padma Ch
>>>>
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
---
Takeshi Yamamuro

Mime
View raw message