hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Igor Dvorzhak (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MAPREDUCE-7185) Parallelize part files move in FileOutputCommitter
Date Mon, 11 Feb 2019 22:53:00 GMT
Igor Dvorzhak created MAPREDUCE-7185:
----------------------------------------

             Summary: Parallelize part files move in FileOutputCommitter
                 Key: MAPREDUCE-7185
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7185
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
    Affects Versions: 2.9.2, 3.2.0
            Reporter: Igor Dvorzhak
         Attachments: MAPREDUCE-7185.patch

If map task outputs multiple files it could be slow to move them from temp directory to output
directory in object stores.

To improve performance we need to parallelize move of more than 1 file in FileOutputCommitter.

Repro:
Start spark-shell:
{code:bash}
spark-shell --num-executors 2 --executor-memory 10G --executor-cores 4 --conf spark.dynamicAllocation.maxExecutors=2
{code}
>From spark-shell:
{code:scala}
val df = (1 to 10000).toList.toDF("value").withColumn("p", $"value" % 10).repartition(50)
df.write.partitionBy("p").mode("overwrite").format("parquet").options(Map("path" -> s"gs://some/path")).saveAsTable("parquet_partitioned_bench")
{code}

With the fix execution time reduces from 130 seconds to 50 seconds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-help@hadoop.apache.org


Mime
View raw message