spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Takeshi Yamamuro <linguin....@gmail.com>
Subject Re: Caching broadcasted DataFrames?
Date Thu, 25 Aug 2016 19:38:02 GMT
Hi,

you need to cache df1 to prevent re-computation (including disk reads)
because spark re-broadcasts
data every sql execution.

// maropu

On Fri, Aug 26, 2016 at 2:07 AM, Jestin Ma <jestinwith.an.e@gmail.com>
wrote:

> I have a DataFrame d1 that I would like to join with two separate
> DataFrames.
> Since d1 is small enough, I broadcast it.
>
> What I understand about cache vs broadcast is that cache leads to each
> executor storing the partitions its assigned in memory (cluster-wide
> in-memory). Broadcast leads to each node (with multiple executors) storing
> a copy of the dataset (all partitions) inside its own memory.
>
> Since the dataset for d1 is used in two separate joins, should I also
> persist it to prevent reading it from disk again? Or would broadcasting the
> data already take care of that?
>
>
> Thank you,
> Jestin
>



-- 
---
Takeshi Yamamuro

Mime
View raw message