spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Marscher <rmarsc...@localytics.com>
Subject Re: Why is huge data shuffling in Spark when using union()/coalesce(1,false) on DataFrame?
Date Tue, 08 Sep 2015 21:24:54 GMT
Hi,

what is the reasoning behind the use of `coalesce(1,false)`? This is saying
to aggregate all data into a single partition, which must fit in memory on
one node in the Spark cluster. If the cluster has more than one node it
must shuffle to move the data. It doesn't seem like the following map or
union necessitate coalesce, but the use case is not clear to me.

On Fri, Sep 4, 2015 at 12:29 PM, unk1102 <umesh.kacha@gmail.com> wrote:

> Hi I have Spark job which does some processing on ORC data and stores back
> ORC data using DataFrameWriter save() API introduced in Spark 1.4.0. I have
> the following piece of code which is using heavy shuffle memory. How do I
> optimize below code? Is there anything wrong with it? It is working fine as
> expected only causing slowness because of GC pause and shuffles lots of
> data
> so hitting memory issues. Please guide I am new to Spark. Thanks in
> advance.
>
> JavaRDD<Row> updatedDsqlRDD = orderedFrame.toJavaRDD().coalesce(1,
> false).map(new Function<Row, Row>() {
>    @Override
>    public Row call(Row row) throws Exception {
>         List rowAsList;
>         Row row1 = null;
>         if (row != null) {
>           rowAsList = iterate(JavaConversions.seqAsJavaList(row.toSeq()));
>           row1 = RowFactory.create(rowAsList.toArray());
>         }
>         return row1;
>    }
> }).union(modifiedRDD);
> DataFrame updatedDataFrame =
> hiveContext.createDataFrame(updatedDsqlRDD,renamedSourceFrame.schema());
>
> updatedDataFrame.write().mode(SaveMode.Append).format("orc").partitionBy("entity",
> "date").save("baseTable");
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-huge-data-shuffling-in-Spark-when-using-union-coalesce-1-false-on-DataFrame-tp24581.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


-- 
*Richard Marscher*
Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Our Blog
<http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
Facebook <http://facebook.com/localytics> | LinkedIn
<http://www.linkedin.com/company/1148792?trk=tyah>

Mime
View raw message