spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Burak Yavuz <>
Subject Re: spark-sorted, or secondary sort and streaming reduce for spark
Date Fri, 06 Mar 2015 19:40:59 GMT
Hi Koert,

Would you like to register this on


On Fri, Mar 6, 2015 at 8:53 AM, Koert Kuipers <> wrote:

> currently spark provides many excellent algorithms for operations per key
> as long as the data send to the reducers per key fits in memory. operations
> like combineByKey, reduceByKey and foldByKey rely on pushing the operation
> map-side so that the data reduce-side is small. and groupByKey simply
> requires that the values per key fit in memory.
> but there are algorithms for which we would like to process all the values
> per key reduce-side, even when they do not fit in memory. examples are
> algorithms that need to process the values ordered, or algorithms that need
> to emit all values again. basically this is what the original hadoop reduce
> operation did so well: it allowed sorting of values (using secondary sort),
> and it processed all values per key in a streaming fashion.
> the library spark-sorted aims to bring these kind of operations back to
> spark, by providing a way to process values with a user provided
> Ordering[V] and a user provided streaming operation Iterator[V] =>
> Iterator[W]. it does not make the assumption that the values need to fit in
> memory per key.
> the basic idea is to rely on spark's sort-based shuffle to re-arrange the
> data so that all values for a given key are placed consecutively within a
> single partition, and then process them using a map-like operation.
> you can find the project here:
> the project is in a very early stage. any feedback is very much
> appreciated.

View raw message