spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jtgenesis <jtgene...@gmail.com>
Subject DataFrame use case
Date Tue, 16 Aug 2016 17:32:50 GMT
Hey guys, I've been digging around trying to figure out if I should
transition from RDDs to DataFrames. I'm currently using RDDs to represent
tiles of binary imagery data and I'm wondering if representing the data as a
DataFrame is a better solution.

To get my feet wet, I did a little comparison on a Word Count application,
on a 1GB file of random text, using an RDD and DataFrame. And I got the
following results:

RDD Count total: 137733312 Time Elapsed: 44.5675378 s
DataFrame Count total: 137733312 Time Elapsed: 69.201253448 s

I figured the DataFrame would outperform the RDD, since I've seen many
sources that state superior speeds with DataFrames. These results could just
be an implementation issue, unstructured data, or a result of the data
source. I'm not really sure. 

This leads me to take a step back and figure out what applications are
better suited with DataFrames than RDDs? In my case, while the original
image file is unstructured. The data is loaded in a pairRDD, where the key
contains multiple attributes that pertain to the value. The value is a chunk
of the image represented as an array of bytes. Since, my data will be in a
structured format, I don't see why I can't benefit from DataFrames. However,
should I be concerned of any performance issues that pertain to
processing/moving of byte array (each chunk is uniform size in the KB-MB
range). I'll potentially be scanning the entire image, select specific image
tiles and perform some work on them.

If DataFrames are well suited for my use case, how does the data source
affect my performance? I could always just load data into an RDD and convert
to DataFrame, or I could convert the image into a parquet file and create
DataFrames directly. Is one way recommended over the other?

These are a lot of questions, and I'm still trying to ingest and make sense
of everything. Any feedback would be greatly appreciated.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-use-case-tp27543.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message