spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raghavendra Pandey <raghavendra.pan...@gmail.com>
Subject Re: Left outer joining big data set with small lookups
Date Sat, 15 Aug 2015 02:10:17 GMT
In spark 1.4 there is a parameter to control that. Its default value is 10
M. So you need to cache your dataframe to hint the size.
On Aug 14, 2015 7:09 PM, "VIJAYAKUMAR JAWAHARLAL" <sparkhelp@data2o.io>
wrote:

> Hi
>
> I am facing huge performance problem when I am trying to left outer join
> very big data set (~140GB) with bunch of small lookups [Start schema type].
> I am using data frame  in spark sql. It looks like data is shuffled and
> skewed when that join happens. Is there any way to improve performance of
> such type of join in spark?
>
> How can I hint optimizer to go with replicated join etc., to avoid
> shuffle? Would it help to create broadcast variables on small lookups?  If
> I create broadcast variables, how can I convert them into data frame and
> use them in sparksql type of join?
>
> Thanks
> Vijay
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message