spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yanwei Zhang <actuary_zh...@hotmail.com>
Subject Use a specific partition of dataframe
Date Wed, 02 Nov 2016 16:28:46 GMT
Is it possible to retrieve a specific partition  (e.g., the first partition) of a DataFrame
and apply some function there? My data is too large, and I just want to get some approximate
measures using the first few partitions in the data. I'll illustrate what I want to accomplish
using the example below:

// create date
val tmp = sc.parallelize(Seq( ("a", 1), ("b", 2), ("a", 1),
                                  ("b", 2), ("a", 1), ("b", 2)), 2).toDF("var", "value")
// I want to get the first partition only, and do some calculation, for example, count by
the value of "var"
tmp1 = tmp.getPartition(0)
tmp1.groupBy("var").count()

The idea is not to go through all the data to save computational time. So I am not sure whether
mapPartitionsWithIndex is helpful in this case, since it still maps all data.

Regards,
Wayne



Mime
View raw message