spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Hamstra <>
Subject Re: RDD and Partition
Date Tue, 28 Jan 2014 20:20:15 GMT
scala> import org.apache.spark.RangePartitioner

scala> val rdd = sc.parallelize(List("apple", "Ball", "cat", "dog",
"Elephant", "fox", "gas", "horse", "index", "jet", "kitsch", "long",
"moon", "Neptune", "ooze", "Pen", "quiet", "rose", "sun", "talk",
"umbrella", "voice", "Walrus", "xeon", "Yam", "zebra"))

scala> rdd.keyBy(s => s(0).toUpper)
res0: org.apache.spark.rdd.RDD[(Char, String)] = MappedRDD[1] at keyBy at

scala> res0.partitionBy(new RangePartitioner[Char, String](26, res0)).values
res2: org.apache.spark.rdd.RDD[String] = MappedRDD[5] at values at

scala> res2.mapPartitionsWithIndex((idx, itr) => => (idx,


On Tue, Jan 28, 2014 at 11:48 AM, Nick Pentreath

> If you do something like:
>{ str => (str.take(1), str) }
> you will have an RDD[(String, String)] where the key is the first
> character of the string. Now when you perform an operation that uses
> partitioning (e.g. reduceByKey) you will end up with the 1st reduce task
> receiving all the strings with A, the 2nd all the strings with B etc. Note
> that you may not be able to enforce that each *machine* gets a different
> letter, but in most cases that doesn't particularly matter as long as you
> get "all values for a given key go to the same reducer" behaviour.
> Perhaps if you expand on your use case we can provide more detailed
> assistance.
> On Tue, Jan 28, 2014 at 9:35 PM, David Thomas <> wrote:
>> Lets say I have an RDD of Strings and there are 26 machines in the
>> cluster. How can I repartition the RDD in such a way that all strings
>> starting with A gets collected on machine1, B on machine2 and so on.

View raw message