By repartition I think you mean coalesce() where you would get one parquet file per partition? 

And this would be a new immutable copy so that you would want to write this new RDD to a different HDFS directory? 


I'd say you have to apply repartition just once to the RDD that is union of all your files.
And it has to be done right before you do anything else.

If something is not needed on your files, then the sooner you project, the better.

I am trying to combine several small text files (each file is approx hundreds of MBs to 2-3 gigs) into one big parquet file.

I am loading each one of them and trying to take a union, however this leads to enormous amounts of partitions, as union keeps on adding the partitions of the input RDDs together.

I also tried loading all the files via wildcard, but that behaves almost the same as union i.e. generates a lot of partitions.

One of the approach that I thought was to reparititon the rdd generated after each union and then continue the process, but I don't know how efficient that is.

Has anyone came across this kind of thing before?

- Apurva