spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <>
Subject Re: Union of multiple RDDs
Date Wed, 22 Jun 2016 00:04:19 GMT
By repartition I think you mean coalesce() where you would get one parquet file per partition?

And this would be a new immutable copy so that you would want to write this new RDD to a different
HDFS directory? 


> On Jun 21, 2016, at 8:06 AM, Eugene Morozov <> wrote:
> Apurva, 
> I'd say you have to apply repartition just once to the RDD that is union of all your
> And it has to be done right before you do anything else.
> If something is not needed on your files, then the sooner you project, the better.
> Hope, this helps.
> --
> Be well!
> Jean Morozov
> On Tue, Jun 21, 2016 at 4:48 PM, Apurva Nandan < <>>
> Hello,
> I am trying to combine several small text files (each file is approx hundreds of MBs
to 2-3 gigs) into one big parquet file. 
> I am loading each one of them and trying to take a union, however this leads to enormous
amounts of partitions, as union keeps on adding the partitions of the input RDDs together.
> I also tried loading all the files via wildcard, but that behaves almost the same as
union i.e. generates a lot of partitions.
> One of the approach that I thought was to reparititon the rdd generated after each union
and then continue the process, but I don't know how efficient that is.
> Has anyone came across this kind of thing before?
> - Apurva 

View raw message