spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arbab Khalil <>
Subject Re: Map side join without broadcast
Date Sat, 29 Jun 2019 22:35:45 GMT
You can use coalesce(1) or repartition on B but it would be better to put A
in cache so that it becomes available on all executors and as well as in
memory because it contians on one row.

On Sat, Jun 29, 2019 at 4:10 PM jelmer <> wrote:

> I have 2 dataframes,
> Dataframe A which contains 1 element per partition that is gigabytes big
> (an index)
> Dataframe B which is made up out of millions of small rows.
> I want to join B on A but i want all the work to be done on the executors
> holding the partitions of dataframe A
> Is there a way to accomplish this without putting dataframe B in a
> broadcast variable or doing a broadcast join ?

Arbab Khalil
Software Design Engineer

View raw message