spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Imran Rashid (JIRA)" <>
Subject [jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
Date Sat, 20 Dec 2014 20:03:13 GMT


Imran Rashid commented on SPARK-3655:

Hey Koert,

good questions about the types, I hadn't really thought about it yet.  I guess I'm actually
proposing 3 type parameters -- the row type doesn't change at all, but there are additional
types for the partitioning and sorting.

val x: RDD[X] = ...
val y: SortedRDD[X,K,V] = x.groupAndSort(f1, f2)

so then you'd have

mapPartitions[Y](f: Iterator[X] => Iterator[Y]): RDD[Y]

mapGroup[Y](f: (K, Iterator[X]) => Iterator[Y]): RDD[Y]

foldByKey[Y](zero:Y)(f: (Y, X) => Y): RDD[Y]

or maybe the return type of mapGroup & foldByKey would be RDD[(K,Seq[Y])] or something
... or there is another variant which would let you return another SortedRDD.  probably need
to try out some variants and see how they look.

Having three type parameters is a little unwieldy ... maybe we don't even bother keeping the
types K & V if they don't actually get us anything.  Eg. I dont' think you actually need
to expose the type V at all.  You really just need to keep an Ordering[X] as a member variable.
 Then groupAndSort takes an X => V and constructs an Ordering[X] out of it.

yeah I dunno about name either ... PartitionSortedRdd?  GroupSortedRdd? ...

Glad you are interested in this and think an implementation would be easy.  I was actually
going to suggest that maybe I'm proposing a bigger change, so it should come after the existing
work you've done.  Especially since I'm really proposing adding some new apis for even basic
partitioning & grouping, even without involving secondary sort at all ...

> Support sorting of values in addition to keys (i.e. secondary sort)
> -------------------------------------------------------------------
>                 Key: SPARK-3655
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 1.1.0, 1.2.0
>            Reporter: koert kuipers
>            Assignee: Koert Kuipers
>            Priority: Minor
> Now that spark has a sort based shuffle, can we expect a secondary sort soon? There are
some use cases where getting a sorted iterator of values per key is helpful.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message