spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gil Vernik <G...@il.ibm.com>
Subject Re: question related partitions of the DataFrame
Date Tue, 14 Jul 2015 08:59:22 GMT
I see that most recent code doesn't has RDDApi anymore.
But i still would like to understand the logic of partitions of DataFrame. 
Does DataFrame has it's own partitions and is sort of RDD by itself, or it 
depends on the partitions of the underline RDD that was used to load the 
data?

For example, if i create DataFrame from HadoopRDD - does it means that 
DataFrame has the same partitions as HadoopRDD?

Thanks
Gil.




From:   Gil Vernik/Haifa/IBM@IBMIL
To:     Dev <dev@spark.apache.org>
Date:   12/07/2015 13:06
Subject:        question related partitions of the DataFrame



Hi, 

DataFrame extends RDDApi, that provides RDD like methods. 
My question is, does DataFrame is sort of  stand alone RDD with it?s own 
partitions or it depends on the underlying RDD that was used to load the 
data into its partitions? It's written that DataFrame has ability to scale 
from kilobytes of data on a single laptop to petabytes on a large cluster, 
but i don't understand if the partitions of data frame are independent of 
the partitions of the data source that was used to load the data. 

So assume theoretically that i used external DataSource API and wrote code 
 that load 1GB of data into single partition. Then I map this DataSource 
to DataFrame and perform some SQL that returns all the records. Will this 
DataFrame also has one partition in memory or Spark somehow will divide 
this DataFrame into various partitions? If so, how it will be divide it 
into partitions? By size? (can someone point me to the code to see some 
example)? 


Thanks, 
Gil.


Mime
View raw message