spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yin Huai (JIRA)" <>
Subject [jira] [Commented] (SPARK-5182) Partitioning support for tables created by the data source API
Date Fri, 09 Jan 2015 22:44:34 GMT


Yin Huai commented on SPARK-5182:

Here is the doc from [~marmbrus].

Partitioning data by one or more columns is a very important optimization for many analytic
workloads.  Right now, the implementation of partitioning in the Data Sources API suffers
from several shortcomings.
First, each data source must implement the support on its own leading to code duplication.
 This duplication applies both to the code of discovering / cataloging partitions, but also
to the code required to evaluate predicates against a given partitions. 
Second, only a limited set of predicates are pushed down and so partitioning misses opportunities
to prune.  While we can continue to expand the set of filters, however, this does not solve
the problem that each data source would still need to implement its own version of expression
evaluation for each new (Filter x DataType).

Requirements for the new API:
* Built in support for telling a data source which partitions it should read based on arbitrary
predicates (including things like UDFS).
* Support for multiple levels of nested directories that store data based on partitioning
attributes (e.g, /table/col1=a/col2=b).
* Rapid auto-discovery of large numbers of partitions.
* Discovery of partition column types using schema inference similar to JSON.
* Support for user defined partitioning schemes? (i.e. /table/2001/02/03)

Proposed interface:
case class Partition(values: Row, path: String)
case class PartitionSpec(
    partitionColumns: StructType, 
    partitions: Array[Partition])

class PartitionedRelation {
  // Has default implementation
  def parsePartitions(paths: Array[String]): PartitionSpec 

  def basePath: String

  def buildScan(
      partitions: Array[Partition], 
      requiredColumns: Array[String], 
      filters: Array[Filter]): RDD[Row]
Open Questions:
* Is it okay to store all of the partition metadata in-memory initially? Or should we consider
storing this data locally to something like BDB?
* Should we be using metastore partitioning instead?

> Partitioning support for tables created by the data source API
> --------------------------------------------------------------
>                 Key: SPARK-5182
>                 URL:
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Yin Huai
>            Priority: Blocker

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message