What you should be doing is a join, something like this:

//Create a key, value pair, key being the column1
val rdd1 = sc.textFile(file1).map(x => (x.split(",")(0),x.split(","))

//Create a key, value pair, key being the column2
val rdd2 = sc.textFile(file2).map(x => (x.split(",")(1),x.split(","))

//Now join the dataset
val joined = rdd1.join(rdd2)

//Now do the replacement
val replaced = joined.map(...)

On Mon, Mar 21, 2016 at 10:31 AM, Shishir Anshuman <shishiranshuman@gmail.com> wrote:
I have stored the contents of two csv files in separate RDDs. 

file1.csv format: (column1,column2,column3)
file2.csv format: (column1, column2)

column1 of file1 and column2 of file2 contains similar data. I want to compare the two columns and if match is found:
  • Replace the data at column1(file1) with the column1(file2)

For this reason, I am not using normal RDD. 

I am still new to apache spark, so any suggestion will be greatly appreciated. 

On Mon, Mar 21, 2016 at 10:09 AM, Prem Sure <premsure542@gmail.com> wrote:
any specific reason you would like to use collectasmap only? You probably move to normal RDD instead of a Pair.

On Monday, March 21, 2016, Mark Hamstra <mark@clearstorydata.com> wrote:
You're not getting what Ted is telling you.  Your `dict` is an RDD[String]  -- i.e. it is a collection of a single value type, String.  But `collectAsMap` is only defined for PairRDDs that have key-value pairs for their data elements.  Both a key and a value are needed to collect into a Map[K, V]. 

On Sun, Mar 20, 2016 at 8:19 PM, Shishir Anshuman <shishiranshuman@gmail.com> wrote:
yes I have included that class in my code.
I guess its something to do with the RDD format. Not able to figure out the exact reason.

On Fri, Mar 18, 2016 at 9:27 AM, Ted Yu <yuzhihong@gmail.com> wrote:
It is defined in:

On Thu, Mar 17, 2016 at 8:55 PM, Shishir Anshuman <shishiranshuman@gmail.com> wrote:
I am using following code snippet in scala:

val dict: RDD[String] = sc.textFile("path/to/csv/file")
val dict_broadcast=sc.broadcast(dict.collectAsMap())

On compiling It generates this error:

scala:42: value collectAsMap is not a member of org.apache.spark.rdd.RDD[String]

val dict_broadcast=sc.broadcast(dict.collectAsMap())