spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tian Tian (Jira)" <>
Subject [jira] [Commented] (SPARK-6221) SparkSQL should support auto merging output files
Date Mon, 23 Dec 2019 03:21:00 GMT


Tian Tian commented on SPARK-6221:

I encounterd this problem and find [issue-24940](

Use /*+ COALESCE(numPartitions) */ or /*+ REPARTITION(numPartitions) */ in spark sql query
will control output file numbers.

In my parctice I recommend second parm for users, because it will generate a new stage to
do this job, while first parm won't which may lead the job dead because of fewer tasks in
the last stage.

> SparkSQL should support auto merging output files
> -------------------------------------------------
>                 Key: SPARK-6221
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Tianyi Wang
>            Priority: Major
> Hive has a feature that could automatically merge small files in HQL's output path. 
> This feature is quite useful for some cases that people use {{insert into}} to  handle
minute data from the input path to a daily table.
> In that case, if the SQL includes {{group by}} or {{join}} operation, we always set the
{{reduce number}} at least 200 to avoid the possible OOM in reduce side.
> That will cause this SQL output at least 200 files at the end of the execution. So the
daily table will finally contains more than 50000 files. 
> If we could provide the same feature in SparkSQL, it will extremely reduce hdfs operations
and spark tasks when we run other sql on this table.

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message