spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guanhua Yan <>
Subject Confused by groupByKey() and the default partitioner
Date Sat, 12 Jul 2014 19:20:09 GMT

I have trouble understanding the default partitioner (hash) in Spark.
Suppose that an RDD with two partitions is created as follows:
x = sc.parallelize([("a", 1), ("b", 4), ("a", 10), ("c", 7)], 2)
Does spark partition x based on the hash of the key (e.g., "a", "b", "c") by
(1) Assuming this is correct, if I further use the groupByKey primitive,
x.groupByKey(), all the records sharing the same key should be located in
the same partition. Then it's not necessary to shuffle the data records
around, as all the grouping operations can be done locally.
(2) If it's not true, how could I specify a partitioner simply based on the
hashing of the key (in Python)?
Thank you,
- Guanhua

View raw message