DataFrames are a narrower, more specific type of abstraction, for tabular data. Where your data is tabular, it makes more sense to use, especially because this knowledge means a lot more can be optimized under the hood for you, whereas the framework can do nothing with an RDD of arbitrary objects. DataFrames are not somehow a "better RDD".

Datasets are more like the new RDDs, supporting more general objects and programmatic access. Still a different thing for a different purpose from DataFrames. But has an API more similar to DataFrames and some of the same types of benefits for simple types via Encoders.

On Tue, Nov 22, 2016 at 2:50 PM jggg777 <> wrote:
I've seen a number of visuals showing the processing time benefits of using
Datasets+DataFrames over RDDs, but I'd assume that there are performance
benefits to using a defined case class instead a generic Dataset[Row].  The
tale of three Spark APIs post mentions "If you want higher degree of
type-safety at compile time, want typed JVM objects, *take advantage of
Catalyst optimization, and benefit from Tungsten’s efficient code
generation, use Dataset.*"

Are there any comparisons showing the performance differences between
Datasets and DataFrames?  Or more information about how Catalyst/Tungsten
handle them differently?

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe e-mail: