spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yash Datta (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-6632) Optimize the parquetSchema to metastore schema reconciliation, so that the process is delegated to each map task itself
Date Tue, 31 Mar 2015 13:23:52 GMT
Yash Datta created SPARK-6632:
---------------------------------

             Summary: Optimize the parquetSchema to metastore schema reconciliation, so that
the process is delegated to each map task itself
                 Key: SPARK-6632
                 URL: https://issues.apache.org/jira/browse/SPARK-6632
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 1.3.0
            Reporter: Yash Datta
             Fix For: 1.4.0


Currently in ParquetRelation2, schema from all the part files is first merged, and then reconciled
with metastore schema. This approach does not scale in case we have thousands of partitions
for the table. We can take a different approach where we can go ahead with the metastore schema,
and reconcile the names of the columns within each map task , using ReadSupport hooks provided
in parquet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message