spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Liang-Chi Hsieh <>
Subject Re: How to cache SparkPlan.execute for reusing?
Date Fri, 03 Mar 2017 02:58:26 GMT

Internally, in each partition of the resulting RDD[InternalRow], you will
get the same UnsafeRow when iterating the rows. Typical RDD.cache doesn't
work for it. You will get the output with the same rows. Not sure why you
get empty output.

Dataset.cache() is used for caching SQL query results. Even you really cache
RDD[InternalRow] by RDD.cache with the trick which copies the rows (with
significant performance penalty), a new query (plan) will not automatically
reuse the cached RDD, because new RDDs will be created.

summerDG wrote
> We are optimizing the Spark SQL for adaptive execution. So the SparkPlan
> maybe reused for strategy choice. But we find that  once the result of
> SparkPlan.execute, RDD[InternalRow], is cached using RDD.cache, the query
> output is empty.
> 1. How to cache the result of SparkPlan.execute?
> 2. Why is RDD.cache invalid for RDD[InternalRow]?

Liang-Chi Hsieh | @viirya 
Spark Technology Center 
View this message in context:
Sent from the Apache Spark Developers List mailing list archive at

To unsubscribe e-mail:

View raw message