spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <guha.a...@gmail.com>
Subject Re: Processing multiple columns in parallel
Date Mon, 18 May 2015 14:46:16 GMT
My first thought would be creating 10 rdds and run your word count on each
of them..I think spark scheduler is going to resolve dependency in parallel
and launch 10 jobs.

Best
Ayan
On 18 May 2015 23:41, "Laeeq Ahmed" <laeeqspark@yahoo.com.invalid> wrote:

> Hi,
>
> Consider I have a tab delimited text file with 10 columns. Each column is
> a a set of text. I would like to do a word count for each column. In scala,
> I would do the following RDD transformation and action:
>
>
>
>
>
> *val data = sc.textFile("hdfs://namenode/data.txt") for(i <- 0 until 9){
>  data.map(_.split("\t",-1)(i)).map((_,1)).reduce(_+_).saveAsTextFile("i") } *
> Within the for loop, it's a parallel process, but each column is
> sequentially processed from 0 to 9.
>
> Is there anyway so that I can process multiple column in parallel in
> Spark? I saw posting about using AKKA, but RDD itself is already using
> AKKA. Any pointers would be appreciated.
>
>
> Regards,
> Laeeq
>

Mime
View raw message