spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Pentreath <nick.pentre...@gmail.com>
Subject Re: RDD and Partition
Date Tue, 28 Jan 2014 19:48:03 GMT
If you do something like:

rdd.map{ str => (str.take(1), str) }

you will have an RDD[(String, String)] where the key is the first character
of the string. Now when you perform an operation that uses partitioning
(e.g. reduceByKey) you will end up with the 1st reduce task receiving all
the strings with A, the 2nd all the strings with B etc. Note that you may
not be able to enforce that each *machine* gets a different letter, but in
most cases that doesn't particularly matter as long as you get "all values
for a given key go to the same reducer" behaviour.

Perhaps if you expand on your use case we can provide more detailed
assistance.


On Tue, Jan 28, 2014 at 9:35 PM, David Thomas <dt5434884@gmail.com> wrote:

> Lets say I have an RDD of Strings and there are 26 machines in the
> cluster. How can I repartition the RDD in such a way that all strings
> starting with A gets collected on machine1, B on machine2 and so on.
>
>

Mime
View raw message