spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Deepak Gopalakrishnan <dgk...@gmail.com>
Subject Re: Mapper side join with DataFrames API
Date Wed, 02 Mar 2016 10:08:29 GMT
Thanks for the help guys.

Just to ask a part of my question in a little different way.

I have attached my screenshots here. There is so much of memory that is
unused and yet there is a spill ( as in screenshots). Any idea why ?

Thanks
Deepak

On Wed, Mar 2, 2016 at 5:14 AM, Michael Armbrust <michael@databricks.com>
wrote:

> Its helpful to always include the output of df.explain(true) when you are
> asking about performance.
>
> On Mon, Feb 29, 2016 at 6:14 PM, Deepak Gopalakrishnan <dgkris@gmail.com>
> wrote:
>
>> Hello All,
>>
>> I'm trying to join 2 dataframes A and B with a
>>
>> sqlContext.sql("SELECT * FROM A INNER JOIN B ON A.a=B.a");
>>
>> Now what I have done is that I have registeredTempTables for A and B
>> after loading these DataFrames from different sources. I need the join to
>> be really fast and I was wondering if there is a way to use the SQL
>> statement and then being able to do a mapper side join ( say my table B is
>> small) ?
>>
>> I read some articles on using broadcast to do mapper side joins. Could I
>> do something like this and then execute my sql statement to achieve mapper
>> side join ?
>>
>> DataFrame B = sparkContext.broadcast(B);
>> B.registerTempTable("B");
>>
>>
>> I have a join as stated above and I see in my executor logs the below :
>>
>> 16/02/29 17:02:35 INFO TaskSetManager: Finished task 198.0 in stage 7.0
>> (TID 1114) in 20354 ms on localhost (196/200)
>>
>> 16/02/29 17:02:35 INFO ShuffleBlockFetcherIterator: Getting 200 non-empty
>> blocks out of 200 blocks
>>
>> 16/02/29 17:02:35 INFO ShuffleBlockFetcherIterator: Started 0 remote
>> fetches in 0 ms
>>
>> 16/02/29 17:02:35 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty
>> blocks out of 128 blocks
>>
>> 16/02/29 17:02:35 INFO ShuffleBlockFetcherIterator: Started 0 remote
>> fetches in 0 ms
>>
>> 16/02/29 17:03:03 INFO Executor: Finished task 199.0 in stage 7.0 (TID
>> 1115). 2511 bytes result sent to driver
>>
>> 16/02/29 17:03:03 INFO TaskSetManager: Finished task 199.0 in stage 7.0
>> (TID 1115) in 27621 ms on localhost (197/200)
>>
>> *16/02/29 17:07:06 INFO UnsafeExternalSorter: Thread 124 spilling sort
>> data of 256.0 KB to disk (0  time so far)*
>>
>>
>> Now, I have around 10G of executor memory and my memory faction should be
>> the default ( 0.75 as per the documentation) and my memory usage is < 1.5G(
>> obtained from the Storage tab on Spark dashboard), but still it says
>> spilling sort data. I'm a little surprised why this happens even when I
>> have enough memory free.
>>
>> Any inputs will be greatly appreciated!
>>
>> Thanks
>> --
>> Regards,
>> *Deepak Gopalakrishnan*
>> *Mobile*:+918891509774
>> *Skype* : deepakgk87
>> http://myexps.blogspot.com
>>
>>
>


-- 
Regards,
*Deepak Gopalakrishnan*
*Mobile*:+918891509774
*Skype* : deepakgk87
http://myexps.blogspot.com

Mime
View raw message