spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Perez <>
Subject Re: saveAsTable broken in v1.3 DataFrames?
Date Thu, 19 Mar 2015 16:34:22 GMT
Hi Yin,

Thanks for the clarification. My first reaction is that if this is the
intended behavior, it is a wasted opportunity. Why create a managed
table in Hive that cannot be read from inside Hive? I think I
understand now that you are essentially piggybacking on Hive's
metastore to persist table info between/across sessions, but I imagine
others might expect more (as I have.)

We find ourselves wanting to do work in Spark and persist the results
where other users (e.g. analysts using Tableau connected to
Hive/Impala) can explore it. I imagine this is very common. I can, of
course, save it as parquet and create an external table in hive (which
I will do now), but saveAsTable seems much less useful to me now.

Any other opinions?



On Thu, Mar 19, 2015 at 9:18 AM, Yin Huai <> wrote:
> I meant table properties and serde properties are used to store metadata of
> a Spark SQL data source table. We do not set other fields like SerDe lib.
> For a user, the output of DESCRIBE EXTENDED/FORMATTED on a data source table
> should not show unrelated stuff like Serde lib and InputFormat. I have
> created to track the
> improvement on the output of DESCRIBE statement.
> On Thu, Mar 19, 2015 at 12:11 PM, Yin Huai <> wrote:
>> Hi Christian,
>> Your table is stored correctly in Parquet format.
>> For saveAsTable, the table created is not a Hive table, but a Spark SQL
>> data source table
>> (
>> We are only using Hive's metastore to store the metadata (to be specific,
>> only table properties and serde properties). When you look at table
>> property, there will be a field called "spark.sql.sources.provider" and the
>> value will be "org.apache.spark.sql.parquet.DefaultSource". You can also
>> look at your files in the file system. They are stored by Parquet.
>> Thanks,
>> Yin
>> On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez <>
>> wrote:
>>> Hi all,
>>> DataFrame.saveAsTable creates a managed table in Hive (v0.13 on
>>> CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong*
>>> schema _and_ storage format in the Hive metastore, so that the table
>>> cannot be read from inside Hive. Spark itself can read the table, but
>>> Hive throws a Serialization error because it doesn't know it is
>>> Parquet.
>>> val df = sc.parallelize( Array((1,2), (3,4)) ).toDF("education",
>>> "income")
>>> df.saveAsTable("spark_test_foo")
>>> Expected:
>>>   education BIGINT,
>>>   income BIGINT
>>> )
>>> SerDe Library:
>>> InputFormat:
>>> Actual:
>>>   col array<string> COMMENT "from deserializer"
>>> )
>>> SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe
>>> InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
>>> ---
>>> Manually changing schema and storage restores access in Hive and
>>> doesn't affect Spark. Note also that Hive's table property
>>> "spark.sql.sources.schema" is correct. At first glance, it looks like
>>> the schema data is serialized when sent to Hive but not deserialized
>>> properly on receive.
>>> I'm tracing execution through source code... but before I get any
>>> deeper, can anyone reproduce this behavior?
>>> Cheers,
>>> Christian
>>> --
>>> Christian Perez
>>> Silicon Valley Data Science
>>> Data Analyst
>>> @cp_phd
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:

Christian Perez
Silicon Valley Data Science
Data Analyst

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message