spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amit Kumar <kumarami...@gmail.com>
Subject RDD with a Map
Date Tue, 03 Jun 2014 21:56:29 GMT
Hi Folks,

I am new to spark -and this is probably a basic question.

I have a file on the hdfs

1, one
1, uno
2, two
2, dos

I want to create a multi Map RDD  RDD[Map[String,List[String]]]

{"1"->["one","uno"], "2"->["two","dos"]}


First I read the file
val identityData:RDD[String] = sc.textFile($path_to_the_file, 2).cache()

val identityDataList:RDD[List[String]]=
      identityData.map{ line =>
        val splits= line.split(",")
        splits.toList
    }

Then I group them by the first element

 val grouped:RDD[(String,Iterable[List[String]])]=
    songArtistDataList.groupBy{
      element =>{
        element(0)
      }
    }

Then I do the equivalent of mapValues of scala collections to get rid of
the first element

 val groupedWithValues:RDD[(String,List[String])] =
    grouped.flatMap[(String,List[String])]{ case (key,list)=>{
      List((key,list.map{element => {
        element(1)
      }}.toList))
    }
    }

for this to actually materialize I do collect

 val groupedAndCollected=groupedWithValues.collect()

I get an Array[String,List[String]].

I am trying to figure out if there is a way for me to get
Map[String,List[String]] (a multimap), or to create an
RDD[Map[String,List[String]] ]


I am sure there is something simpler, I would appreciate advice.

Many thanks,
Amit

Mime
View raw message