spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Laeeq Ahmed <laeeqsp...@yahoo.com.INVALID>
Subject Processing multiple columns in parallel
Date Mon, 18 May 2015 13:37:56 GMT
Hi,
Consider I have a tab delimited text file with 10 columns. Each column is a a set of text.
I would like to do a word count for each column. In scala, I would do the following RDD transformation
and action: 

val data = sc.textFile("hdfs://namenode/data.txt") 
for(i <- 0 until 9){ 
   data.map(_.split("\t",-1)(i)).map((_,1)).reduce(_+_).saveAsTextFile("i") 
} 

Within the for loop, it's a parallel process, but each column is sequentially processed from
0 to 9. 

Is there anyway so that I can process multiple column in parallel in Spark? I saw posting
about using AKKA, but RDD itself is already using AKKA. Any pointers would be appreciated.


Regards,Laeeq
Mime
View raw message