spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yaochunnan <>
Subject How can I do pair-wise computation between RDD feature columns?
Date Sun, 17 May 2015 01:13:04 GMT
Hi all, 
Recently I've ran into a scenario to conduct two sample tests between all
paired combination of columns of an RDD. But the networking load and
generation of pair-wise computation is too time consuming. That has puzzled
me for a long time. I want to conduct Wilcoxon rank-sum test
( here, and get the
top k most similar pairs. 

To be more concrete, I want to:
input: original = RDD[Array[Double](3000)]
output: a matrix M of the size 3000x3000, where M{i}{j} equals to the result
of a certain statistical test between RDD columns, that is, and

I've read the source code of Pearson and Spearman's correlation in MLlib
Statistics, as well as the implementation of the DIMSUM algorithm in
RowMatrix.scala, cuz they all conduct pair-wise computation between columns 
in a paralleled way. However, it seems that the reason why those tests are
applicable in Spark is because they only exploit column-summary info (sum of
all elements in RDD[Double[) and information in the same array, to be
explicit, they are all similar to the following:
input: original = RDD[Array[Double](3000)]
step1: summary = original.aggregate
step2: summary_br = sc.broadcast(summary)
step3: result ={i => val summary_v = summary_br.value; some
computation on i}.aggregate
output: result: a matrix of 3000x3000

They do not require info exchange between different records in RDD. However,
wilcoxon test requires co-ranking between pairs. It seems I have to generate
pair-wise computations one by one on RDD columns. This will conduct at least
(n^2-n)/2 jobs, which is nearly 5000000 when n=3000. It is not acceptable. 

Does anyone have better ideas? This is really torturing me cuz I have a
related project on hand!

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message