spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yi Tian (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-6221) SparkSQL should support auto mergiing
Date Mon, 09 Mar 2015 06:48:38 GMT
Yi Tian created SPARK-6221:
------------------------------

             Summary: SparkSQL should support auto mergiing
                 Key: SPARK-6221
                 URL: https://issues.apache.org/jira/browse/SPARK-6221
             Project: Spark
          Issue Type: New Feature
          Components: SQL
            Reporter: Yi Tian


Hive has a feature that could automatically merge small files in HQL's output path. 
This feature is quite useful for some cases that people use {{insert into}} to  handle minute
data from the input path to a daily table.
In that case, if the SQL includes {{group by}} or {{join}} operation, we always set the {{reduce
number}} at least 200 to avoid the possible OOM in reduce side.
That will cause this SQL output at least 200 files at the end of the execution. So the daily
table will finally contains more than 50000 files. 
If we could provide the same feature in SparkSQL, it will extremely reduce hdfs operations
and spark tasks when we run other sql on this table.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message