spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mendelson, Assaf" <Assaf.Mendel...@rsa.com>
Subject RE: Use a specific partition of dataframe
Date Thu, 03 Nov 2016 07:33:29 GMT
There are a couple of tools you can use. Take a look at the various functions.
Specifically, limit might be useful for you and sample/sampleBy functions can make your data
smaller.
Actually, when using CreateDataframe you can sample the data to begin with.

Specifically working by partitions can be done by moving through the RDD interface but I am
not sure this is what you want. Actually working through a specific partition might mean seeing
skewed data because of the hashing method used to partition (this would of course depend on
how your dataframe was created).

Just to get smaller data sample/sampleBy seems like the best solution to me.

Assaf.

From: Yanwei Zhang [mailto:actuary_zhang@hotmail.com]
Sent: Wednesday, November 02, 2016 6:29 PM
To: user
Subject: Use a specific partition of dataframe

Is it possible to retrieve a specific partition  (e.g., the first partition) of a DataFrame
and apply some function there? My data is too large, and I just want to get some approximate
measures using the first few partitions in the data. I'll illustrate what I want to accomplish
using the example below:

// create date
val tmp = sc.parallelize(Seq( ("a", 1), ("b", 2), ("a", 1),
                                  ("b", 2), ("a", 1), ("b", 2)), 2).toDF("var", "value")
// I want to get the first partition only, and do some calculation, for example, count by
the value of "var"
tmp1 = tmp.getPartition(0)
tmp1.groupBy("var").count()

The idea is not to go through all the data to save computational time. So I am not sure whether
mapPartitionsWithIndex is helpful in this case, since it still maps all data.

Regards,
Wayne



Mime
View raw message