spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shao, Saisai" <saisai.s...@intel.com>
Subject RE: Sorting Reduced/Groupd Values without Explicit Sorting
Date Mon, 30 Jun 2014 03:17:32 GMT
Yes, the current implementation has the memory limitation, the community already noticed this
problem and there's a patch to solve this problem (PR931<https://github.com/apache/spark/pull/931>),
you can click to see the details.

Also as you said, current Spark cannot guarantee the order of elements within partitions after
shuffle, so you have to sort by yourself.

Thanks
Saisai.

From: Parsian, Mahmoud [mailto:mparsian@illumina.com]
Sent: Monday, June 30, 2014 11:08 AM
To: user@spark.apache.org
Subject: RE: Sorting Reduced/Groupd Values without Explicit Sorting

Hi Jerry,

Thank you for replying to my question. If indeed, spark does not have "secondary sort" by
framework, then that is a limitation. There might be cases where you have more values per
key that can not be handled in a commodity server's memory (I mean sorting values in RAM).

If we had a partitioner by a natural key (by name), which preserved the order of RDD, then
that would be a viable solution: for example, if we sort by (name, time), we would get:

(x,1),(1,3)
(x,2),(2,9)
(x,3),(3,6)
(y,1),(1,7)
(y,2),(2,5)
(y,3),(3,1)
(z,1),(1,4)
(z,2),(2,8)
(z,3),(3,7)
(z,4),(4,0)

There is a partitioner, but it does not preserve the order of RDD elements.

Thanks again,
best,
Mahmoud

________________________________
From: Shao, Saisai [saisai.shao@intel.com]
Sent: Sunday, June 29, 2014 6:41 PM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: Sorting Reduced/Groupd Values without Explicit Sorting
Hi Mahmoud,

I think you cannot achieve this in current Spark framework, because current Spark's Shuffle
is based on hash, which is different from MapReduce's sort-based shuffle, so you should implement
sorting explicitly using RDD operator.

Thanks
Jerry

From: Parsian, Mahmoud [mailto:mparsian@illumina.com]
Sent: Monday, June 30, 2014 9:00 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Sorting Reduced/Groupd Values without Explicit Sorting

Given the following time series data:

name, time, value
x,2,9
x,1,3
x,3,6
y,2,5
y,1,7
y,3,1
z,3,7
z,4,0
z,1,4
z,2,8

we want to generate the following (the reduced/grouped values are sorted by time).

x => [(1,3), (2,9), (3,6)]
y => [(1,7), (2,5), (3,1)]
z => [(1,4), (2,8), (3,7), (4,0)]

One obvious way to sort the value by time is that use Java's collection sort (to sort in memory).

How can we achieve sorted values by time WITHOUT explicit sorting in Spark (I mean by using
Spark framework)?

In Java/MapReduce/Hadoop, we can sort reducer values without explicit sorting:
        job.setPartitionerClass(MyPartitioner.class);
        job.setGroupingComparatorClass(MyGroupingComparator.class);

The question is how to sort grouped/reduced values without explicit sorting?

Thanks,
best,
Mahmoud







Mime
View raw message