spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Wendell <pwend...@gmail.com>
Subject Re: Union of 2 RDD's only returns the first one
Date Thu, 23 Jan 2014 00:46:49 GMT
Ah somehow after all this time I've never seen that!

On Wed, Jan 22, 2014 at 4:45 PM, Aureliano Buendia <buendia360@gmail.com> wrote:
>
>
>
> On Thu, Jan 23, 2014 at 12:37 AM, Patrick Wendell <pwendell@gmail.com>
> wrote:
>>
>> What is the ++ operator here? Is this something you defined?
>
>
> No, it's an alias for union defined in RDD.scala:
>
> def ++(other: RDD[T]): RDD[T] = this.union(other)
>
>>
>>
>> Another issue is that RDD's are not ordered, so when you union two
>> together it doesn't have a well defined ordering.
>>
>> If you do want to do this you could coalesce into one partition, then
>> call MapPartitions and return an iterator that first adds your header
>> and then the rest of the file, then call saveAsTextFile. Keep in mind
>> this will only work if you coalesce into a single partition.
>
>
> Thanks! I'll give this a try.
>
>>
>>
>> myRdd.coalesce(1)
>> .map(_.mkString(",")))
>> .mapPartitions(it => (Seq("col1,col2,col3") ++ it).iterator)
>> .saveAsTextFile("out.csv")
>>
>> - Patrick
>>
>> On Wed, Jan 22, 2014 at 11:12 AM, Aureliano Buendia
>> <buendia360@gmail.com> wrote:
>> > Hi,
>> >
>> > I'm trying to find a way to create a csv header when using
>> > saveAsTextFile,
>> > and I came up with this:
>> >
>> > (sc.makeRDD(Array("col1,col2,col3"), 1) ++
>> > myRdd.coalesce(1).map(_.mkString(",")))
>> >       .saveAsTextFile("out.csv")
>> >
>> > But it only saves the header part. Why is that the union method does not
>> > return both RDD's?
>
>

Mime
View raw message