So sorry about teasing you with the Scala. But the method is there in Java too, I just checked.

On Fri, Sep 19, 2014 at 2:02 PM, Victor Tso-Guillen <> wrote:
It might not be the same as a real hadoop reducer, but I think it would accomplish the same. Take a look at:

import org.apache.spark.SparkContext._
// val rdd: RDD[(K, V)]
// def zero(value: V): S
// def reduce(agg: S, value: V): S
// def merge(agg1: S, agg2: S): S
val reducedUnsorted: RDD[(K, S)] = rdd.combineByKey[Int](zero, reduce, merge)

On Fri, Sep 19, 2014 at 1:37 PM, Steve Lewis <> wrote:
 I am struggling to reproduce the functionality of a Hadoop reducer on Spark (in Java)

in Hadoop I have a function
public void doReduce(K key, Iterator<V> values) 
in Hadoop there is also a consumer (context write) which can be seen as consume(key,value)

In my code 
1) knowing the key is important to the function
2) there is neither one output tuple2 per key nor one output tuple2 per value
3) the number of values per key might be large enough that storing them in memory is impractical 
4) keys must appear in sorted order

one good example would run through a large document using a similarity function to look at the last 200 lines and output any of those with a similarity of more than 0.3 (do not suggest output all and filter - the real problem is more complex) the critical concern is an uncertain number of tuples per key.

my questions 
1) how can this be done - ideally a consumer would be a JavaPairRDD but I don't see how to create one and add items later

2) how do I handle the entire partition, sort, process (involving calls to doReduce) process