spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arbab Khalil <akha...@an10.io>
Subject Re: Map side join without broadcast
Date Sat, 29 Jun 2019 22:35:45 GMT
You can use coalesce(1) or repartition on B but it would be better to put A
in cache so that it becomes available on all executors and as well as in
memory because it contians on one row.

On Sat, Jun 29, 2019 at 4:10 PM jelmer <jkuperus@gmail.com> wrote:

> I have 2 dataframes,
>
> Dataframe A which contains 1 element per partition that is gigabytes big
> (an index)
>
> Dataframe B which is made up out of millions of small rows.
>
> I want to join B on A but i want all the work to be done on the executors
> holding the partitions of dataframe A
>
> Is there a way to accomplish this without putting dataframe B in a
> broadcast variable or doing a broadcast join ?
>
>

-- 
Regards,
Arbab Khalil
Software Design Engineer

Mime
View raw message