spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yaochunnan <yaochun...@gmail.com>
Subject How can I do pair-wise computation between RDD feature columns?
Date Sun, 17 May 2015 01:13:04 GMT
Hi all, 
Recently I've ran into a scenario to conduct two sample tests between all
paired combination of columns of an RDD. But the networking load and
generation of pair-wise computation is too time consuming. That has puzzled
me for a long time. I want to conduct Wilcoxon rank-sum test
(http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test) here, and get the
top k most similar pairs. 

To be more concrete, I want to:
input: original = RDD[Array[Double](3000)]
output: a matrix M of the size 3000x3000, where M{i}{j} equals to the result
of a certain statistical test between RDD columns, that is,
original.map(_(i)) and original.map(_(j))

I've read the source code of Pearson and Spearman's correlation in MLlib
Statistics, as well as the implementation of the DIMSUM algorithm in
RowMatrix.scala, cuz they all conduct pair-wise computation between columns 
in a paralleled way. However, it seems that the reason why those tests are
applicable in Spark is because they only exploit column-summary info (sum of
all elements in RDD[Double[) and information in the same array, to be
explicit, they are all similar to the following:
input: original = RDD[Array[Double](3000)]
step1: summary = original.aggregate
step2: summary_br = sc.broadcast(summary)
step3: result =  original.map{i => val summary_v = summary_br.value; some
computation on i}.aggregate
output: result: a matrix of 3000x3000

They do not require info exchange between different records in RDD. However,
wilcoxon test requires co-ranking between pairs. It seems I have to generate
pair-wise computations one by one on RDD columns. This will conduct at least
(n^2-n)/2 jobs, which is nearly 5000000 when n=3000. It is not acceptable. 

Does anyone have better ideas? This is really torturing me cuz I have a
related project on hand!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-do-pair-wise-computation-between-RDD-feature-columns-tp22920.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message