spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jelmer <>
Subject Map side join without broadcast
Date Sat, 29 Jun 2019 11:10:01 GMT
I have 2 dataframes,

Dataframe A which contains 1 element per partition that is gigabytes big
(an index)

Dataframe B which is made up out of millions of small rows.

I want to join B on A but i want all the work to be done on the executors
holding the partitions of dataframe A

Is there a way to accomplish this without putting dataframe B in a
broadcast variable or doing a broadcast join ?

View raw message