spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From unk1102 <>
Subject Spark rdd.mapPartitionsWithIndex() hits physical memory limit after huge data shuffle
Date Wed, 09 Sep 2015 19:37:43 GMT
Hi I have the following Spark code which involves huge data shuffling even
though using mapPartitionswithIndex() with shuffle false. I have 2 TB of
skewed data to process and then convert rdd into dataframe and use it as
table in hiveContext.sql(). I am using 60 executors with 20 GB memory and 4
cores. I specify spark.yarn.executor.memoryOverhead as 8500 which is high
enough. I am using default settings for spark.shuffle.memoryFraction and I also tried to change its settings but none
helped. I am using Spark 1.4.0 Please guide I am new to Spark help me
optimize the following code. Thanks in advance.

 JavaRDD<Row> indexedRdd = sourceRdd.cache().mapPartitionsWithIndex(new
Function2<Integer, Iterator&lt;Row>, Iterator<Row>>() {
        public Iterator<Row> call(Integer ind, Iterator<Row> rowIterator)
throws Exception {
             List<Row> rowList = new ArrayList<>();
             while (rowIterator.hasNext()) {
             Row row =;
             List rowAsList =
             Row updatedRow = RowFactory.create(rowAsList.toArray());
           return rowList.iterator();
     }, false).union(remainingRdd);
    DataFrame baseFrame =
    hiveContext.sql("insert into abc bla bla using baseTable group by bla
 hiveContext.sql("insert into def bla bla using baseTable group by bla

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message