spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Wendell <pwend...@gmail.com>
Subject Re: Union of 2 RDD's only returns the first one
Date Thu, 23 Jan 2014 00:37:30 GMT
What is the ++ operator here? Is this something you defined?

Another issue is that RDD's are not ordered, so when you union two
together it doesn't have a well defined ordering.

If you do want to do this you could coalesce into one partition, then
call MapPartitions and return an iterator that first adds your header
and then the rest of the file, then call saveAsTextFile. Keep in mind
this will only work if you coalesce into a single partition.

myRdd.coalesce(1)
.map(_.mkString(",")))
.mapPartitions(it => (Seq("col1,col2,col3") ++ it).iterator)
.saveAsTextFile("out.csv")

- Patrick

On Wed, Jan 22, 2014 at 11:12 AM, Aureliano Buendia
<buendia360@gmail.com> wrote:
> Hi,
>
> I'm trying to find a way to create a csv header when using saveAsTextFile,
> and I came up with this:
>
> (sc.makeRDD(Array("col1,col2,col3"), 1) ++
> myRdd.coalesce(1).map(_.mkString(",")))
>       .saveAsTextFile("out.csv")
>
> But it only saves the header part. Why is that the union method does not
> return both RDD's?

Mime
View raw message