spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lalwani, Jayesh" <>
Subject Re: Why doesn't spark use broadcast join?
Date Thu, 29 Mar 2018 14:54:54 GMT
Try putting a Broadcast hint like show here

From: Vitaliy Pisarev <>
Date: Thursday, March 29, 2018 at 8:42 AM
To: "" <>
Subject: Why doesn't spark use broadcast join?

I am looking at the physical plan for the following query:

SELECT f1,f2,f3,...
WHERE  f1 = 'bla'
       AND f2 = 'bla2'
       AND some_date >= date_sub(current_date(), 1)
An important detail: the table 'T1' can be very large (hundreds of thousands of rows), but
table T2 is rather small. Maximun in the thousands.
In this particular case, the table T2 has 2 rows.

In the physical plan, I see that a SortMergeJoin is performed. Despite it being the perfect
candidate for a broadcast join.

What could be the reason for this?
Is there a way to hint the optimizer to perform a broadcast join in the sql syntax?

I am writing this in pyspark and the query itself is over parquets stored in Azure blob storage.


The information contained in this e-mail is confidential and/or proprietary to Capital One
and/or its affiliates and may only be used solely in performance of work or services for Capital
One. The information transmitted herewith is intended only for use by the individual or entity
to which it is addressed. If the reader of this message is not the intended recipient, you
are hereby notified that any review, retransmission, dissemination, distribution, copying
or other use of, or taking of any action in reliance upon this information is strictly prohibited.
If you have received this communication in error, please contact the sender and delete the
material from your computer.
View raw message