spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacek Laskowski <ja...@japila.pl>
Subject Re: understanding spark shuffle file re-use better
Date Sun, 17 Jan 2021 14:06:40 GMT
Hi,

An interesting question that I must admit I'm not sure how to answer myself
actually :)

Off the top of my head, I'd **guess** unless you cache the first query
these two queries would share nothing. With caching, there's a phase in
query execution when a canonicalized version of a query is used to look up
any cached queries.

Again, I'm not really sure and if I'd have to answer it (e.g. as part of an
interview) I'd say nothing would be shared / re-used.

Pozdrawiam,
Jacek Laskowski
----
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Wed, Jan 13, 2021 at 5:39 PM Koert Kuipers <koert@tresata.com> wrote:

> is shuffle file re-use based on identity or equality of the dataframe?
>
> for example if run the exact same code twice to load data and do
> transforms (joins, aggregations, etc.) but without re-using any actual
> dataframes, will i still see skipped stages thanks to shuffle file re-use?
>
> thanks!
> koert
>

Mime
View raw message