spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sanjay Subramanian <sanjaysubraman...@yahoo.com.INVALID>
Subject Re: FlatMapValues
Date Mon, 05 Jan 2015 18:30:52 GMT
cool let me adapt that. thanks a tonregardssanjay
      From: Sean Owen <sowen@cloudera.com>
 To: Sanjay Subramanian <sanjaysubramanian@yahoo.com> 
Cc: "user@spark.apache.org" <user@spark.apache.org> 
 Sent: Monday, January 5, 2015 3:19 AM
 Subject: Re: FlatMapValues
   
For the record, the solution I was suggesting was about like this:

inputRDD.flatMap { input =>
  val tokens = input.split(',')
  val id = tokens(0)
  val keyValuePairs = tokens.tail.grouped(2)
  val keys = keyValuePairs.map(_(0))
  keys.map(key => (id, key))
}

This is much more efficient.



On Wed, Dec 31, 2014 at 3:46 PM, Sean Owen <sowen@cloudera.com> wrote:
> From the clarification below, the problem is that you are calling
> flatMapValues, which is only available on an RDD of key-value tuples.
> Your map function returns a tuple in one case but a String in the
> other, so your RDD is a bunch of Any, which is not at all what you
> want. You need to return a tuple in both cases, which is what Kapil
> pointed out.
>
> However it's still not quite what you want. Your input is basically
> [key value1 value2 value3] so you want to flatMap that to (key,value1)
> (key,value2) (key,value3). flatMapValues does not come into play.
>
> On Wed, Dec 31, 2014 at 3:25 PM, Sanjay Subramanian
> <sanjaysubramanian@yahoo.com> wrote:
>> My understanding is as follows
>>
>> STEP 1 (This would create a pair RDD)
>> =======
>>
>> reacRdd.map(line => line.split(',')).map(fields => {
>>  if (fields.length >= 11 && !fields(0).contains("VAERS_ID")) {
>>
>> (fields(0),(fields(1)+"\t"+fields(3)+"\t"+fields(5)+"\t"+fields(7)+"\t"+fields(9)))
>>  }
>>  else {
>>    ""
>>  }
>>  })
>>
>> STEP 2
>> =======
>> Since previous step created a pair RDD, I thought flatMapValues method will
>> be applicable.
>> But the code does not even compile saying that flatMapValues is not
>> applicable to RDD :-(
>>
>>
>> reacRdd.map(line => line.split(',')).map(fields => {
>>  if (fields.length >= 11 && !fields(0).contains("VAERS_ID")) {
>>
>> (fields(0),(fields(1)+"\t"+fields(3)+"\t"+fields(5)+"\t"+fields(7)+"\t"+fields(9)))
>>  }
>>  else {
>>    ""
>>  }
>>  }).flatMapValues(skus =>
>> skus.split('\t')).saveAsTextFile("/data/vaers/msfx/reac/" + outFile)
>>
>>
>> SUMMARY
>> =======
>> when a dataset looks like the following
>>
>> 1,red,blue,green
>> 2,yellow,violet,pink
>>
>> I want to output the following and I am asking how do I do that ? Perhaps my
>> code is 100% wrong. Please correct me and educate me :-)
>>
>> 1,red
>> 1,blue
>> 1,green
>> 2,yellow
>> 2,violet
>> 2,pink


  
Mime
View raw message