spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: Spark SQL, Hive & Parquet data types
Date Tue, 24 Feb 2015 03:08:45 GMT
Ah, sorry for not being clear enough.

So now in Spark 1.3.0, we have two Parquet support implementations, the 
old one is tightly coupled with the Spark SQL framework, while the new 
one is based on data sources API. In both versions, we try to intercept 
operations over Parquet tables registered in metastore when possible for 
better performance (mainly filter push-down optimization and extra 
metadata for more accurate schema inference). The distinctions are:

 1.

    For old version (set |spark.sql.parquet.useDataSourceApi| to |false|)

    When |spark.sql.hive.convertMetastoreParquet| is set to |true|, we
    “hijack” the read path. Namely whenever you query a Parquet table
    registered in metastore, we’re using our own Parquet implementation.

    For write path, we fallback to default Hive SerDe implementation
    (namely Spark SQL’s |InsertIntoHiveTable| operator).

 2.

    For new data source version (set
    |spark.sql.parquet.useDataSourceApi| to |true|, which is the default
    value in master and branch-1.3)

    When |spark.sql.hive.convertMetastoreParquet| is set to |true|, we
    “hijack” both read and write path, but if you’re writing to a
    partitioned table, we still fallback to default Hive SerDe
    implementation.

For Spark 1.2.0, only 1 applies. Spark 1.2.0 also has a Parquet data 
source, but it’s not enabled if you’re not using data sources API 
specific DDL (|CREATE TEMPORARY TABLE <table-name> USING <data-source>|).

Cheng

On 2/23/15 10:05 PM, The Watcher wrote:

>> Yes, recently we improved ParquetRelation2 quite a bit. Spark SQL uses its
>> own Parquet support to read partitioned Parquet tables declared in Hive
>> metastore. Only writing to partitioned tables is not covered yet. These
>> improvements will be included in Spark 1.3.0.
>>
>> Just created SPARK-5948 to track writing to partitioned Parquet tables.
>>
> Ok, this is still a little confusing.
>
> Since I am able in 1.2.0 to write to a partitioned Hive by registering my
> SchemaRDD and calling INSERT into "the hive partitionned table" SELECT "the
> registrered", what is the write-path in this case ? Full Hive with a
> SparkSQL<->Hive bridge ?
> If that were the case, why wouldn't SKEWED ON be honored (see another
> thread I opened).
>
> Thanks
>
​

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message