If you do something like:

rdd.map{ str => (str.take(1), str) }

you will have an RDD[(String, String)] where the key is the first character of the string. Now when you perform an operation that uses partitioning (e.g. reduceByKey) you will end up with the 1st reduce task receiving all the strings with A, the 2nd all the strings with B etc. Note that you may not be able to enforce that each machine gets a different letter, but in most cases that doesn't particularly matter as long as you get "all values for a given key go to the same reducer" behaviour.

Perhaps if you expand on your use case we can provide more detailed assistance.


On Tue, Jan 28, 2014 at 9:35 PM, David Thomas <dt5434884@gmail.com> wrote:
Lets say I have an RDD of Strings and there are 26 machines in the cluster. How can I repartition the RDD in such a way that all strings starting with A gets collected on machine1, B on machine2 and so on.