spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stanley Burnitt <>
Subject ReduceByKey OOME workaround: 'sample and subtract', but rdd.subtract does not subtract.
Date Fri, 11 Oct 2013 23:51:34 GMT
Sampled RDDs contain the number of expected elements, but RDD.subtract(sample) does nothing;
the returned RDDs have the same # of elements as the originalRDD.

Here is the code...

    long count0 = resultRDD.count();

    JavaPairRDD<KeyBytesWritable, AggregationWritable> r1 = resultRDD.sample(false,
0.50, 7);
    JavaPairRDD<KeyBytesWritable, AggregationWritable> r2 = resultRDD.subtract(r1);

    long count1 = r1.count();
    long count2 = r2.count();    /* count0 == count2 ?!  */

    System.out.printf("RESULT-RDD.COUNT=%d ... COUNT1=%d ... COUNT2=%d  \n", count0, count1,

The printf output shows nothing was subtracted from the original rdd.    Am I using the subtract
method incorrectly?

The  goal is to split a very large Pair-RDD containing 1 partition into a Pair-RDD containing
many smaller partitions in hopes that this will fix OOM errors during a reduceByKey task.
After much tweaking of jvm GC parameters there is improvement, but eventually, the job just
stalls out - no OOM, but no progress (jstack and jmap shows workers' new-gen is maxed out
during RDD de-serialization).

Does anyone have any suggestions?


View raw message