spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yi Huang <>
Subject Spark Union Breaks Caching Behaviour
Date Tue, 07 Apr 2020 16:44:35 GMT
Dear Community,

I am a beginner of using Spark. I am confused by the comment of the
following method.

def union(other: Dataset[T]): Dataset[T] = withSetOperator {
  // This breaks caching, but it's usually ok because it addresses a very
specific use case:
  // using union to union many files or partitions.

and here is the corresponding PR comment

Another option would just be to do this at construction time, that way we
can avoid paying the cost in the analyzer. *This would still limit the
cases we could cache (i.e. we'd miss cached data unioned with other data),
but that doesn't seem like a huge deal.*

Could anyone please kindly explain to me what does *This breaks caching *mean?
It would be awesome if an example is given.

Best regards,
Yi Huang

View raw message