spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shao, Saisai" <saisai.s...@intel.com>
Subject RE: Sorting Reduced/Groupd Values without Explicit Sorting
Date Mon, 30 Jun 2014 01:41:45 GMT
Hi Mahmoud,

I think you cannot achieve this in current Spark framework, because current Spark's Shuffle
is based on hash, which is different from MapReduce's sort-based shuffle, so you should implement
sorting explicitly using RDD operator.

Thanks
Jerry

From: Parsian, Mahmoud [mailto:mparsian@illumina.com]
Sent: Monday, June 30, 2014 9:00 AM
To: user@spark.apache.org
Subject: Sorting Reduced/Groupd Values without Explicit Sorting

Given the following time series data:

name, time, value
x,2,9
x,1,3
x,3,6
y,2,5
y,1,7
y,3,1
z,3,7
z,4,0
z,1,4
z,2,8

we want to generate the following (the reduced/grouped values are sorted by time).

x => [(1,3), (2,9), (3,6)]
y => [(1,7), (2,5), (3,1)]
z => [(1,4), (2,8), (3,7), (4,0)]

One obvious way to sort the value by time is that use Java's collection sort (to sort in memory).

How can we achieve sorted values by time WITHOUT explicit sorting in Spark (I mean by using
Spark framework)?

In Java/MapReduce/Hadoop, we can sort reducer values without explicit sorting:
        job.setPartitionerClass(MyPartitioner.class);
        job.setGroupingComparatorClass(MyGroupingComparator.class);

The question is how to sort grouped/reduced values without explicit sorting?

Thanks,
best,
Mahmoud







Mime
View raw message