spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Is there a processing speed difference between DataFrames and Datasets?
Date Tue, 22 Nov 2016 14:53:46 GMT
DataFrames are a narrower, more specific type of abstraction, for tabular
data. Where your data is tabular, it makes more sense to use, especially
because this knowledge means a lot more can be optimized under the hood for
you, whereas the framework can do nothing with an RDD of arbitrary objects.
DataFrames are not somehow a "better RDD".

Datasets are more like the new RDDs, supporting more general objects and
programmatic access. Still a different thing for a different purpose from
DataFrames. But has an API more similar to DataFrames and some of the same
types of benefits for simple types via Encoders.

On Tue, Nov 22, 2016 at 2:50 PM jggg777 <> wrote:

> I've seen a number of visuals showing the processing time benefits of using
> Datasets+DataFrames over RDDs, but I'd assume that there are performance
> benefits to using a defined case class instead a generic Dataset[Row].  The
> tale of three Spark APIs post mentions "If you want higher degree of
> type-safety at compile time, want typed JVM objects, *take advantage of
> Catalyst optimization, and benefit from Tungsten’s efficient code
> generation, use Dataset.*"
> Are there any comparisons showing the performance differences between
> Datasets and DataFrames?  Or more information about how Catalyst/Tungsten
> handle them differently?
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe e-mail:

View raw message