you need to cache df1 to prevent re-computation (including disk reads) because spark re-broadcasts
data every sql execution.

// maropu

On Fri, Aug 26, 2016 at 2:07 AM, Jestin Ma <jestinwith.an.e@gmail.com> wrote:
I have a DataFrame d1 that I would like to join with two separate DataFrames.
Since d1 is small enough, I broadcast it.

What I understand about cache vs broadcast is that cache leads to each executor storing the partitions its assigned in memory (cluster-wide in-memory). Broadcast leads to each node (with multiple executors) storing a copy of the dataset (all partitions) inside its own memory.

Since the dataset for d1 is used in two separate joins, should I also persist it to prevent reading it from disk again? Or would broadcasting the data already take care of that?

Thank you,

Takeshi Yamamuro