Is it possible to retrieve a specific partition (e.g., the first partition) of a DataFrame and apply some function there? My data is too large, and I just want to get some approximate measures using the first few partitions in the data. I'll illustrate
what I want to accomplish using the example below:
// create date
val tmp = sc.parallelize(Seq( ("a", 1), ("b", 2), ("a", 1),
("b", 2), ("a", 1), ("b", 2)), 2).toDF("var", "value")
// I want to get the first partition only, and do some calculation, for example, count by the value of "var"
tmp1 = tmp.getPartition(0)
The idea is not to go through all the data to save computational
time. So I am not sure whether mapPartitionsWithIndex is helpful in this case, since it still maps all data.