spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Debasish Das <debasish.da...@gmail.com>
Subject Re: How can I do pair-wise computation between RDD feature columns?
Date Sun, 17 May 2015 01:43:29 GMT
I opened it up today but it should help you:

https://github.com/apache/spark/pull/6213

On Sat, May 16, 2015 at 6:18 PM, Chunnan Yao <yaochunnan@gmail.com> wrote:

> Hi all,
> Recently I've ran into a scenario to conduct two sample tests between all
> paired combination of columns of an RDD. But the networking load and
> generation of pair-wise computation is too time consuming. That has puzzled
> me for a long time. I want to conduct Wilcoxon rank-sum test
> (http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test) here, and get
> the
> top k most similar pairs.
>
> To be more concrete, I want to:
> input: original = RDD[Array[Double](3000)]
> output: a matrix M of the size 3000x3000, where M{i}{j} equals to the
> result
> of a certain statistical test between RDD columns, that is,
> original.map(_(i)) and original.map(_(j))
>
> I've read the source code of Pearson and Spearman's correlation in MLlib
> Statistics, as well as the implementation of the DIMSUM algorithm in
> RowMatrix.scala, cuz they all conduct pair-wise computation between columns
> in a paralleled way. However, it seems that the reason why those tests are
> applicable in Spark is because they only exploit column-summary info (sum
> of
> all elements in RDD[Double[) and information in the same array, to be
> explicit, they are all similar to the following:
> input: original = RDD[Array[Double](3000)]
> step1: summary = original.aggregate
> step2: summary_br = sc.broadcast(summary)
> step3: result =  original.map{i => val summary_v = summary_br.value; some
> computation on i}.aggregate
> output: result: a matrix of 3000x3000
>
> They do not require info exchange between different records in RDD.
> However,
> wilcoxon test requires co-ranking between pairs. It seems I have to
> generate
> pair-wise computations one by one on RDD columns. This will conduct at
> least
> (n^2-n)/2 jobs, which is nearly 5000000 when n=3000. It is not acceptable.
>
> Does anyone have better ideas? This is really torturing me cuz I have a
> related project on hand!
>
>
>
> -----
> Feel the sparking Spark!
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/How-can-I-do-pair-wise-computation-between-RDD-feature-columns-tp12287.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Mime
View raw message