spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gil Vernik <>
Subject question related partitions of the DataFrame
Date Sun, 12 Jul 2015 10:05:57 GMT

DataFrame extends RDDApi, that provides RDD like methods.
My question is, does DataFrame is sort of  stand alone RDD with it?s own 
partitions or it depends on the underlying RDD that was used to load the 
data into its partitions? It's written that DataFrame has ability to scale 
from kilobytes of data on a single laptop to petabytes on a large cluster, 
but i don't understand if the partitions of data frame are independent of 
the partitions of the data source that was used to load the data.

So assume theoretically that i used external DataSource API and wrote code 
 that load 1GB of data into single partition. Then I map this DataSource 
to DataFrame and perform some SQL that returns all the records. Will this 
DataFrame also has one partition in memory or Spark somehow will divide 
this DataFrame into various partitions? If so, how it will be divide it 
into partitions? By size? (can someone point me to the code to see some 

View raw message