spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jestin Ma <jestinwith.a...@gmail.com>
Subject Caching broadcasted DataFrames?
Date Thu, 25 Aug 2016 17:07:25 GMT
I have a DataFrame d1 that I would like to join with two separate
DataFrames.
Since d1 is small enough, I broadcast it.

What I understand about cache vs broadcast is that cache leads to each
executor storing the partitions its assigned in memory (cluster-wide
in-memory). Broadcast leads to each node (with multiple executors) storing
a copy of the dataset (all partitions) inside its own memory.

Since the dataset for d1 is used in two separate joins, should I also
persist it to prevent reading it from disk again? Or would broadcasting the
data already take care of that?


Thank you,
Jestin

Mime
View raw message