spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: Primitive arrays in Spark
Date Tue, 21 Oct 2014 23:07:45 GMT
It seems that ++ does the right thing on arrays of longs, and gives you another one:

scala> val a = Array[Long](1,2,3)
a: Array[Long] = Array(1, 2, 3)

scala> val b = Array[Long](1,2,3)
b: Array[Long] = Array(1, 2, 3)

scala> a ++ b
res0: Array[Long] = Array(1, 2, 3, 1, 2, 3)

scala> res0.getClass
res1: Class[_ <: Array[Long]] = class [J

The problem might be that lots of intermediate space is allocated as you merge values two
by two. In particular, if a key has N arrays mapping to it, your code will allocate O(N^2)
space because it builds first an array of size 1, then 2, then 3, etc. You can make this faster
by using aggregateByKey instead, and using an intermediate data structure other than an Array
to do the merging (ideally you'd find a growable ArrayBuffer-like class specialized for Longs,
but you can also just try ArrayBuffer).

Matei



> On Oct 21, 2014, at 1:08 PM, Akshat Aranya <aaranya@gmail.com> wrote:
> 
> This is as much of a Scala question as a Spark question
> 
> I have an RDD:
> 
> val rdd1: RDD[(Long, Array[Long])]
> 
> This RDD has duplicate keys that I can collapse such
> 
> val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) => a++b)
> 
> If I start with an Array of primitive longs in rdd1, will rdd2 also have Arrays of primitive
longs?  I suspect, based on my memory usage, that this is not the case.
> 
> Also, would it be more efficient to do this:
> 
> val rdd1: RDD[(Long, ArrayBuffer[Long])]
> 
> and then
> 
> val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) => a++b).map(_.toArray)
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message