spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Corey Nolet <cjno...@gmail.com>
Subject Re: Partition + equivalent of MapReduce multiple outputs
Date Wed, 28 Jan 2015 14:07:50 GMT
In all of the soutions I've found thus far, sorting has been by casting the
partition iterator into an array and sorting the array. This is not going
to work for my case as the amount of data in each partition may not
necessarily fit into memory. Any ideas?

On Wed, Jan 28, 2015 at 1:29 AM, Corey Nolet <cjnolet@gmail.com> wrote:

> I wanted to update this thread for others who may be looking for a
> solution to his as well. I found [1] and I'm going to investigate if this
> is a viable solution.
>
> [1]
> http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job
>
> On Wed, Jan 28, 2015 at 12:51 AM, Corey Nolet <cjnolet@gmail.com> wrote:
>
>> I need to be able to take an input RDD[Map[String,Any]] and split it into
>> several different RDDs based on some partitionable piece of the key
>> (groups) and then send each partition to a separate set of files in
>> different folders in HDFS.
>>
>> 1) Would running the RDD through a custom partitioner be the best way to
>> go about this or should I split the RDD into different RDDs and call
>> saveAsHadoopFile() on each?
>> 2) I need the resulting partitions sorted by key- they also need to be
>> written to the underlying files in sorted order.
>> 3) The number of keys in each partition will almost always be too big to
>> fit into memory.
>>
>> Thanks.
>>
>
>

Mime
View raw message