spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Victor Tso-Guillen <v...@paxata.com>
Subject Re: Reproducing the function of a Hadoop Reducer
Date Fri, 19 Sep 2014 21:02:06 GMT
It might not be the same as a real hadoop reducer, but I think it would
accomplish the same. Take a look at:

import org.apache.spark.SparkContext._
// val rdd: RDD[(K, V)]
// def zero(value: V): S
// def reduce(agg: S, value: V): S
// def merge(agg1: S, agg2: S): S
val reducedUnsorted: RDD[(K, S)] = rdd.combineByKey[Int](zero, reduce,
merge)
reducedUnsorted.sortByKey()

On Fri, Sep 19, 2014 at 1:37 PM, Steve Lewis <lordjoe2000@gmail.com> wrote:

>  I am struggling to reproduce the functionality of a Hadoop reducer on
> Spark (in Java)
>
> in Hadoop I have a function
> public void doReduce(K key, Iterator<V> values)
> in Hadoop there is also a consumer (context write) which can be seen as
> consume(key,value)
>
> In my code
> 1) knowing the key is important to the function
> 2) there is neither one output tuple2 per key nor one output tuple2 per
> value
> 3) the number of values per key might be large enough that storing them in
> memory is impractical
> 4) keys must appear in sorted order
>
> one good example would run through a large document using a similarity
> function to look at the last 200 lines and output any of those with a
> similarity of more than 0.3 (do not suggest output all and filter - the
> real problem is more complex) the critical concern is an uncertain number
> of tuples per key.
>
> my questions
> 1) how can this be done - ideally a consumer would be a JavaPairRDD but I
> don't see how to create one and add items later
>
> 2) how do I handle the entire partition, sort, process (involving calls to
> doReduce) process
>
>
>

Mime
View raw message