spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: coalescing RDD into equally sized partitions
Date Tue, 25 Mar 2014 06:11:42 GMT
This happened because they were integers equal to 0 mod 5, and we used the default hashCode
implementation for integers, which will map them all to 0. There’s no API method that will
look at the resulting partition sizes and rebalance them, but you could use another hash function.

Matei

On Mar 24, 2014, at 5:20 PM, Walrus theCat <walrusthecat@gmail.com> wrote:

> Hi,
> 
> sc.parallelize(Array.tabulate(100)(i=>i)).filter( _ % 20 == 0 ).coalesce(5,true).glom.collect
 yields
> 
> Array[Array[Int]] = Array(Array(0, 20, 40, 60, 80), Array(), Array(), Array(), Array())
> 
> How do I get something more like:
> 
>  Array(Array(0), Array(20), Array(40), Array(60), Array(80))
> 
> Thanks


Mime
View raw message