I am trying to find the code that cleans up uncached RDD.

 

Thanks,

Nasrulla

 

From: Charoes <charoes@gmail.com>
Sent: Tuesday, May 21, 2019 5:10 PM
To: Nasrulla Khan Haris <Nasrulla.Khan@microsoft.com.invalid>
Cc: Wenchen Fan <cloud0fan@gmail.com>; dev@spark.apache.org
Subject: Re: RDD object Out of scope.

 

If you cached a RDD and hold a reference of that RDD in your code, then your RDD will NOT be cleaned up.

There is a ReferenceQueue in ContextCleaner, which is used to keep tracking the reference of RDD, Broadcast, and Accumulator etc.

 

On Wed, May 22, 2019 at 1:07 AM Nasrulla Khan Haris <Nasrulla.Khan@microsoft.com.invalid> wrote:

Thanks for reply Wenchen, I am curious as what happens when RDD goes out of scope when it is not cached.

 

Nasrulla

 

From: Wenchen Fan <cloud0fan@gmail.com>
Sent: Tuesday, May 21, 2019 6:28 AM
To: Nasrulla Khan Haris <Nasrulla.Khan@microsoft.com.invalid>
Cc: dev@spark.apache.org
Subject: Re: RDD object Out of scope.

 

RDD is kind of a pointer to the actual data. Unless it's cached, we don't need to clean up the RDD.

 

On Tue, May 21, 2019 at 1:48 PM Nasrulla Khan Haris <Nasrulla.Khan@microsoft.com.invalid> wrote:

HI Spark developers,

 

Can someone point out the code where RDD objects go out of scope ?. I found the contextcleaner code in which only persisted RDDs are cleaned up in regular intervals if the RDD is registered to cleanup. I have not found where the destructor for RDD object is invoked. I am trying to understand when RDD cleanup happens when the RDD is not persisted.

 

Thanks in advance, appreciate your help.

Nasrulla