spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: Identifying specific persisted DataFrames via getPersistentRDDs()
Date Fri, 04 May 2018 02:26:32 GMT
Why do you need the underlying RDDs? Can't you just unpersist the
dataframes that you don't need?


On Mon, Apr 30, 2018 at 8:17 PM Nicholas Chammas <nicholas.chammas@gmail.com>
wrote:

> This seems to be an underexposed part of the API. My use case is this: I
> want to unpersist all DataFrames except a specific few. I want to do this
> because I know at a specific point in my pipeline that I have a handful of
> DataFrames that I need, and everything else is no longer needed.
>
> The problem is that there doesn’t appear to be a way to identify specific
> DataFrames (or rather, their underlying RDDs) via getPersistentRDDs(),
> which is the only way I’m aware of to ask Spark for all currently persisted
> RDDs:
>
> >>> a = spark.range(10).persist()>>> a.rdd.id()8>>> list(spark.sparkContext._jsc.getPersistentRDDs().items())
> [(3, JavaObject id=o36)]
>
> As you can see, the id of the persisted RDD, 8, doesn’t match the id
> returned by getPersistentRDDs(), 3. So I can’t go through the RDDs
> returned by getPersistentRDDs() and know which ones I want to keep.
>
> id() itself appears to be an undocumented method of the RDD API, and in
> PySpark getPersistentRDDs() is buried behind the Java sub-objects
> <https://issues.apache.org/jira/browse/SPARK-2141>, so I know I’m
> reaching here. But is there a way to do what I want in PySpark without
> manually tracking everything I’ve persisted myself?
>
> And more broadly speaking, do we want to add additional APIs, or formalize
> currently undocumented APIs like id(), to make this use case possible?
>
> Nick
> ​
>

Mime
View raw message