spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-12394) Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive)
Date Thu, 17 Dec 2015 06:35:46 GMT
Reynold Xin created SPARK-12394:
-----------------------------------

             Summary: Support writing out pre-hash-partitioned data and exploit that in join
optimizations to avoid shuffle (i.e. bucketing in Hive)
                 Key: SPARK-12394
                 URL: https://issues.apache.org/jira/browse/SPARK-12394
             Project: Spark
          Issue Type: New Feature
          Components: SQL
            Reporter: Reynold Xin


In many cases users know ahead of time the columns that they will be joining or aggregating
on.  Ideally they should be able to leverage this information and pre-shuffle the data so
that subsequent queries do not require a shuffle.  Hive supports this functionality by allowing
the user to define buckets, which are hash partitioning of the data based on some key.

 - Allow the user to specify a set of columns when caching or writing out data
 - Allow the user to specify some parallelism
 - Shuffle the data when writing / caching such that its distributed by these columns
 - When planning/executing  a query, use this distribution to avoid another shuffle when reading,
assuming the join or aggregation is compatible with the columns specified
 - Should work with existing save modes: append, overwrite, etc
 - Should work at least with all Hadoops FS data sources
 - Should work with any data source when caching



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message