spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Corey Nolet <cjno...@gmail.com>
Subject Partition + equivalent of MapReduce multiple outputs
Date Wed, 28 Jan 2015 05:51:00 GMT
I need to be able to take an input RDD[Map[String,Any]] and split it into
several different RDDs based on some partitionable piece of the key
(groups) and then send each partition to a separate set of files in
different folders in HDFS.

1) Would running the RDD through a custom partitioner be the best way to go
about this or should I split the RDD into different RDDs and call
saveAsHadoopFile() on each?
2) I need the resulting partitions sorted by key- they also need to be
written to the underlying files in sorted order.
3) The number of keys in each partition will almost always be too big to
fit into memory.

Thanks.

Mime
View raw message