spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yi Huang <huang.yi.3...@gmail.com>
Subject Spark Union Breaks Caching Behaviour
Date Tue, 07 Apr 2020 16:44:35 GMT
Dear Community,

I am a beginner of using Spark. I am confused by the comment of the
following method.

def union(other: Dataset[T]): Dataset[T] = withSetOperator {
  // This breaks caching, but it's usually ok because it addresses a very
specific use case:
  // using union to union many files or partitions.
  CombineUnions(Union(logicalPlan,
other.logicalPlan)).mapChildren(AnalysisBarrier)
}

and here is the corresponding PR comment
https://github.com/apache/spark/pull/10577#discussion_r48820132


Another option would just be to do this at construction time, that way we
can avoid paying the cost in the analyzer. *This would still limit the
cases we could cache (i.e. we'd miss cached data unioned with other data),
but that doesn't seem like a huge deal.*


Could anyone please kindly explain to me what does *This breaks caching *mean?
It would be awesome if an example is given.

Best regards,
Yi Huang

Mime
View raw message