spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: DataFrame use case
Date Tue, 16 Aug 2016 18:05:06 GMT
I'd say that Datasets, not DataFrames, are the natural evolution of
RDDs. DataFrames are for inherently tabular data, and most naturally
manipulated by SQL-like operations. Datasets operate on programming
language objects like RDDs.

So, RDDs to DataFrames isn't quite apples-to-apples to begin with.
It's just never true that "X is always faster than Y" in a case like
this. Indeed your case doesn't sound like anything where a tabular
representation would be beneficial. There's overhead to treating it
like that. You're doing almost nothing to the data itself except
counting it, and RDDs have the lowest overhead of the three concepts
because they treat their contents as opaque objects anyway.

The benefit comes when you do things like SQL-like operations on
tabular data in the DataFrame API instead of RDD API. That's where
more optimization can kick in. Dataset brings some of the same
possible optimizations to an RDD-like API because it has more
knowledge of the type and nature of the entire data set.

If you're really only manipulating byte arrays, I don't know if
DataFrame adds anything. I know Dataset has some specialization for
byte[], so I'd expect you could see some storage benefits over RDDs,

On Tue, Aug 16, 2016 at 6:32 PM, jtgenesis <> wrote:
> Hey guys, I've been digging around trying to figure out if I should
> transition from RDDs to DataFrames. I'm currently using RDDs to represent
> tiles of binary imagery data and I'm wondering if representing the data as a
> DataFrame is a better solution.
> To get my feet wet, I did a little comparison on a Word Count application,
> on a 1GB file of random text, using an RDD and DataFrame. And I got the
> following results:
> RDD Count total: 137733312 Time Elapsed: 44.5675378 s
> DataFrame Count total: 137733312 Time Elapsed: 69.201253448 s
> I figured the DataFrame would outperform the RDD, since I've seen many
> sources that state superior speeds with DataFrames. These results could just
> be an implementation issue, unstructured data, or a result of the data
> source. I'm not really sure.
> This leads me to take a step back and figure out what applications are
> better suited with DataFrames than RDDs? In my case, while the original
> image file is unstructured. The data is loaded in a pairRDD, where the key
> contains multiple attributes that pertain to the value. The value is a chunk
> of the image represented as an array of bytes. Since, my data will be in a
> structured format, I don't see why I can't benefit from DataFrames. However,
> should I be concerned of any performance issues that pertain to
> processing/moving of byte array (each chunk is uniform size in the KB-MB
> range). I'll potentially be scanning the entire image, select specific image
> tiles and perform some work on them.
> If DataFrames are well suited for my use case, how does the data source
> affect my performance? I could always just load data into an RDD and convert
> to DataFrame, or I could convert the image into a parquet file and create
> DataFrames directly. Is one way recommended over the other?
> These are a lot of questions, and I'm still trying to ingest and make sense
> of everything. Any feedback would be greatly appreciated.
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe e-mail:

To unsubscribe e-mail:

View raw message