//Create a key, value pair, key being the column1val rdd1 = sc.textFile(file1).map(x => (x.split(",")(0),x.split(","))//Create a key, value pair, key being the column2val rdd2 = sc.textFile(file2).map(x => (x.split(",")(1),x.split(","))//Now join the datasetval joined = rdd1.join(rdd2)//Now do the replacementval replaced = joined.map(...)
I have stored the contents of two csv files in separate RDDs.file1.csv format: (column1,column2,column3)file2.csv format: (column1, column2)column1 of file1 and column2 of file2 contains similar data. I want to compare the two columns and if match is found:
- Replace the data at column1(file1) with the column1(file2)For this reason, I am not using normal RDD.I am still new to apache spark, so any suggestion will be greatly appreciated.On Mon, Mar 21, 2016 at 10:09 AM, Prem Sure <firstname.lastname@example.org> wrote:any specific reason you would like to use collectasmap only? You probably move to normal RDD instead of a Pair.
On Monday, March 21, 2016, Mark Hamstra <email@example.com> wrote:You're not getting what Ted is telling you. Your `dict` is an RDD[String] -- i.e. it is a collection of a single value type, String. But `collectAsMap` is only defined for PairRDDs that have key-value pairs for their data elements. Both a key and a value are needed to collect into a Map[K, V].On Sun, Mar 20, 2016 at 8:19 PM, Shishir Anshuman <firstname.lastname@example.org> wrote:yes I have included that class in my code.I guess its something to do with the RDD format. Not able to figure out the exact reason.On Fri, Mar 18, 2016 at 9:27 AM, Ted Yu <email@example.com> wrote:It is defined in:core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scalaOn Thu, Mar 17, 2016 at 8:55 PM, Shishir Anshuman <firstname.lastname@example.org> wrote:I am using following code snippet in scala:val dict: RDD[String] = sc.textFile("path/to/csv/file")val dict_broadcast=sc.broadcast(dict.collectAsMap())On compiling It generates this error:scala:42: value collectAsMap is not a member of org.apache.spark.rdd.RDD[String]val dict_broadcast=sc.broadcast(dict.collectAsMap())^