spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Corey Nolet <>
Subject Re: Partition + equivalent of MapReduce multiple outputs
Date Wed, 28 Jan 2015 14:16:53 GMT
I'm looking @ the ShuffledRDD code and it looks like there is a method
setKeyOrdering()- is this guaranteed to order everything in the partition?
I'm on Spark 1.2.0

On Wed, Jan 28, 2015 at 9:07 AM, Corey Nolet <> wrote:

> In all of the soutions I've found thus far, sorting has been by casting
> the partition iterator into an array and sorting the array. This is not
> going to work for my case as the amount of data in each partition may not
> necessarily fit into memory. Any ideas?
> On Wed, Jan 28, 2015 at 1:29 AM, Corey Nolet <> wrote:
>> I wanted to update this thread for others who may be looking for a
>> solution to his as well. I found [1] and I'm going to investigate if this
>> is a viable solution.
>> [1]
>> On Wed, Jan 28, 2015 at 12:51 AM, Corey Nolet <> wrote:
>>> I need to be able to take an input RDD[Map[String,Any]] and split it
>>> into several different RDDs based on some partitionable piece of the key
>>> (groups) and then send each partition to a separate set of files in
>>> different folders in HDFS.
>>> 1) Would running the RDD through a custom partitioner be the best way to
>>> go about this or should I split the RDD into different RDDs and call
>>> saveAsHadoopFile() on each?
>>> 2) I need the resulting partitions sorted by key- they also need to be
>>> written to the underlying files in sorted order.
>>> 3) The number of keys in each partition will almost always be too big to
>>> fit into memory.
>>> Thanks.

View raw message