spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vamshi Talla <vamsh...@hotmail.com>
Subject Re: repartition
Date Mon, 09 Jul 2018 02:32:51 GMT
Hi Ravi,

RDDs are always immutable, so you cannot change them, instead you create new ones by transforming
one. Repartition is a transformation, so it lazily evaluated, hence computed only when you
call an action on it.

Thanks.
Vamshi Talla

On Jul 8, 2018, at 12:26 PM, <ryandam.9@gmail.com<mailto:ryandam.9@gmail.com>>
<ryandam.9@gmail.com<mailto:ryandam.9@gmail.com>> wrote:

Hi,

Can anyone clarify how repartition works please ?


  *   I have a DataFrame df which has only one partition:


// Returns 1
df.rdd.getNumPartitions

•         I repartitioned it by passing “3” and assigned it a new DataFrame newdf

val newdf = df.repartition(3)


•         newdf shows 3 as number of partitions
// Returns 3
newdf.rdd.getNumPartitions

•         df still shows 1

// Return 1
df.rdd.getNumPartitions

My Question is that,


  1.  How does repartition work ? Does it copy original dataframe and create X partitions
as specified by  repartition ?  If that is the case, aren’t there two copies of same data
in memory as shown in below diagram ?

Or my understanding is incorrect ?

As per executions above,  looks like there are two copies as after repartition, df still has
1 partition !!


  1.  Repartition is executed immediately or it waits for some trigger [kind of action] ?



<image001.png>

Thanks,
Ravi

Mime
View raw message