spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nirav Patel <npa...@xactlycorp.com>
Subject spark 2.2.x - Broadcasthashjoin is not happening even after checkpointing
Date Thu, 08 Nov 2018 00:12:42 GMT
I am joining two datasets: one with few hundred million record and another
is just 72 records. Without doing anything it tries to do SortMergeJoin
(shuffle exchange) and blows with OOM. I expect it to do mapjoin (broadcast
join)
I have auto boradcast on and I am not repartitioning my dataset.

It works now if I save small dataset and read it back. It doesn't work if I
checkpoint!

Attaching two screen shot. 1st one is where I am checkpointing small
dataset.

[image: Screen Shot 2018-11-07 at 4.04.04 PM.png]

Above is reading ExistingRDD from checkpoint. It has only 72 records and
still decided to do shuffle join!

Here when I save it :

[image: Screen Shot 2018-11-07 at 4.03.53 PM.png]

now it does broadcast join.

So my workaround is to save and read back small dataset.

Why checkpointing didn't work?

Why without checkpointing or saving it doesn't work? (I don't have this
lineage here as it's too big and complicated) checkpointing does help to
truncate previous lineage by executing it but what happened after that was
not expected.

-- 


 <http://www.xactlycorp.com/email-click/>

 
<https://www.instagram.com/xactlycorp/>   
<https://www.linkedin.com/company/xactly-corporation>   
<https://twitter.com/Xactly>   <https://www.facebook.com/XactlyCorp>   
<http://www.youtube.com/xactlycorporation>

Mime
View raw message