spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lonikar <loni...@gmail.com>
Subject Re: Column operation on Spark RDDs.
Date Mon, 08 Jun 2015 08:45:45 GMT
Two simple suggestions:
1. No need to call zipWithIndex twice. Use the earlier RDD dt.
2. Replace zipWithIndex with zipWithUniqueId which does not trigger a spark
job

Below your code with the above changes:

var dataRDD = sc.textFile("/test.csv").map(_.split(","))
val dt = dataRDD.*zipWithUniqueId*.map(_.swap)
val newCol1 = *dt*.map {case (i, x) => (i, x(1)+x(18)) }
val newCol2 = newCol1.join(dt).map(x=> function(.........))

Hope this helps.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Column-operation-on-Spark-RDDs-tp23165p23203.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message