spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Cartesian join on RDDs taking too much time
Date Wed, 25 May 2016 10:52:12 GMT
What is the use case of this ? A Cartesian product is by definition slow in any system. Why
do you need this? How long does your application take now?

> On 25 May 2016, at 12:42, Priya Ch <learnings.chitturi@gmail.com> wrote:
> 
> I tried dataframe.write.format("com.databricks.spark.csv").save("/hdfs_path"). Even this
is taking too much time.
> 
> Thanks,
> Padma Ch
> 
>> On Wed, May 25, 2016 at 3:47 PM, Takeshi Yamamuro <linguin.m.s@gmail.com> wrote:
>> Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as parquet,
orc, ...?
>> 
>> // maropu
>> 
>>> On Wed, May 25, 2016 at 7:10 PM, Priya Ch <learnings.chitturi@gmail.com>
wrote:
>>> Hi , Yes I have joined using DataFrame join. Now to save this into hdfs .I am
converting the joined dataframe to rdd (dataframe.rdd) and using saveAsTextFile, trying to
save it. However, this is also taking too much time.
>>> 
>>> Thanks,
>>> Padma Ch
>>> 
>>>> On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro <linguin.m.s@gmail.com>
wrote:
>>>> Hi, 
>>>> 
>>>> Seems you'd be better off using DataFrame#join instead of  RDD.cartesian
>>>> because it always needs shuffle operations which have alot of overheads such
as reflection, serialization, ...
>>>> In your case,  since the smaller table is 7mb, DataFrame#join uses a broadcast
strategy.
>>>> This is a little more efficient than  RDD.cartesian.
>>>> 
>>>> // maropu
>>>> 
>>>>> On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:
>>>>> It is basically a Cartesian join like RDBMS 
>>>>> 
>>>>> Example:
>>>>> 
>>>>> SELECT * FROM FinancialCodes,  FinancialData
>>>>> 
>>>>> The results of this query matches every row in the FinancialCodes table
with every row in the FinancialData table.  Each row consists of all columns from the FinancialCodes
table followed by all columns from the FinancialData table.
>>>>> 
>>>>> Not very useful 
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> http://talebzadehmich.wordpress.com
>>>>>  
>>>>> 
>>>>>> On 25 May 2016 at 08:05, Priya Ch <learnings.chitturi@gmail.com>
wrote:
>>>>>> Hi All,
>>>>>> 
>>>>>>   I have two RDDs A and B where in A is of size 30 MB and B is of
size 7 MB, A.cartesian(B) is taking too much time. Is there any bottleneck in cartesian operation
?
>>>>>> 
>>>>>> I am using spark 1.6.0 version
>>>>>> 
>>>>>> Regards,
>>>>>> Padma Ch
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> ---
>>>> Takeshi Yamamuro
>> 
>> 
>> 
>> -- 
>> ---
>> Takeshi Yamamuro
> 

Mime
View raw message