spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eugene Morozov <evgeny.a.moro...@gmail.com>
Subject Re: Union of multiple RDDs
Date Tue, 21 Jun 2016 15:06:43 GMT
Apurva,

I'd say you have to apply repartition just once to the RDD that is union of
all your files.
And it has to be done right before you do anything else.

If something is not needed on your files, then the sooner you project, the
better.

Hope, this helps.

--
Be well!
Jean Morozov

On Tue, Jun 21, 2016 at 4:48 PM, Apurva Nandan <apurva3000@gmail.com> wrote:

> Hello,
>
> I am trying to combine several small text files (each file is approx
> hundreds of MBs to 2-3 gigs) into one big parquet file.
>
> I am loading each one of them and trying to take a union, however this
> leads to enormous amounts of partitions, as union keeps on adding the
> partitions of the input RDDs together.
>
> I also tried loading all the files via wildcard, but that behaves almost
> the same as union i.e. generates a lot of partitions.
>
> One of the approach that I thought was to reparititon the rdd generated
> after each union and then continue the process, but I don't know how
> efficient that is.
>
> Has anyone came across this kind of thing before?
>
> - Apurva
>
>
>

Mime
View raw message