spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <msegel_had...@hotmail.com>
Subject Re: Union of multiple RDDs
Date Wed, 22 Jun 2016 00:04:19 GMT
By repartition I think you mean coalesce() where you would get one parquet file per partition?


And this would be a new immutable copy so that you would want to write this new RDD to a different
HDFS directory? 

-Mike

> On Jun 21, 2016, at 8:06 AM, Eugene Morozov <evgeny.a.morozov@gmail.com> wrote:
> 
> Apurva, 
> 
> I'd say you have to apply repartition just once to the RDD that is union of all your
files.
> And it has to be done right before you do anything else.
> 
> If something is not needed on your files, then the sooner you project, the better.
> 
> Hope, this helps.
> 
> --
> Be well!
> Jean Morozov
> 
> On Tue, Jun 21, 2016 at 4:48 PM, Apurva Nandan <apurva3000@gmail.com <mailto:apurva3000@gmail.com>>
wrote:
> Hello,
> 
> I am trying to combine several small text files (each file is approx hundreds of MBs
to 2-3 gigs) into one big parquet file. 
> 
> I am loading each one of them and trying to take a union, however this leads to enormous
amounts of partitions, as union keeps on adding the partitions of the input RDDs together.
> 
> I also tried loading all the files via wildcard, but that behaves almost the same as
union i.e. generates a lot of partitions.
> 
> One of the approach that I thought was to reparititon the rdd generated after each union
and then continue the process, but I don't know how efficient that is.
> 
> Has anyone came across this kind of thing before?
> 
> - Apurva 
> 
> 
> 


Mime
View raw message