spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Perez <christ...@svds.com>
Subject Re: saveAsTable broken in v1.3 DataFrames?
Date Thu, 19 Mar 2015 16:34:22 GMT
Hi Yin,

Thanks for the clarification. My first reaction is that if this is the
intended behavior, it is a wasted opportunity. Why create a managed
table in Hive that cannot be read from inside Hive? I think I
understand now that you are essentially piggybacking on Hive's
metastore to persist table info between/across sessions, but I imagine
others might expect more (as I have.)

We find ourselves wanting to do work in Spark and persist the results
where other users (e.g. analysts using Tableau connected to
Hive/Impala) can explore it. I imagine this is very common. I can, of
course, save it as parquet and create an external table in hive (which
I will do now), but saveAsTable seems much less useful to me now.

Any other opinions?

Cheers,

C

On Thu, Mar 19, 2015 at 9:18 AM, Yin Huai <yhuai@databricks.com> wrote:
> I meant table properties and serde properties are used to store metadata of
> a Spark SQL data source table. We do not set other fields like SerDe lib.
> For a user, the output of DESCRIBE EXTENDED/FORMATTED on a data source table
> should not show unrelated stuff like Serde lib and InputFormat. I have
> created https://issues.apache.org/jira/browse/SPARK-6413 to track the
> improvement on the output of DESCRIBE statement.
>
> On Thu, Mar 19, 2015 at 12:11 PM, Yin Huai <yhuai@databricks.com> wrote:
>>
>> Hi Christian,
>>
>> Your table is stored correctly in Parquet format.
>>
>> For saveAsTable, the table created is not a Hive table, but a Spark SQL
>> data source table
>> (http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources).
>> We are only using Hive's metastore to store the metadata (to be specific,
>> only table properties and serde properties). When you look at table
>> property, there will be a field called "spark.sql.sources.provider" and the
>> value will be "org.apache.spark.sql.parquet.DefaultSource". You can also
>> look at your files in the file system. They are stored by Parquet.
>>
>> Thanks,
>>
>> Yin
>>
>> On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez <christian@svds.com>
>> wrote:
>>>
>>> Hi all,
>>>
>>> DataFrame.saveAsTable creates a managed table in Hive (v0.13 on
>>> CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong*
>>> schema _and_ storage format in the Hive metastore, so that the table
>>> cannot be read from inside Hive. Spark itself can read the table, but
>>> Hive throws a Serialization error because it doesn't know it is
>>> Parquet.
>>>
>>> val df = sc.parallelize( Array((1,2), (3,4)) ).toDF("education",
>>> "income")
>>> df.saveAsTable("spark_test_foo")
>>>
>>> Expected:
>>>
>>> COLUMNS(
>>>   education BIGINT,
>>>   income BIGINT
>>> )
>>>
>>> SerDe Library:
>>> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
>>> InputFormat:
>>> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
>>>
>>> Actual:
>>>
>>> COLUMNS(
>>>   col array<string> COMMENT "from deserializer"
>>> )
>>>
>>> SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe
>>> InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
>>>
>>> ---
>>>
>>> Manually changing schema and storage restores access in Hive and
>>> doesn't affect Spark. Note also that Hive's table property
>>> "spark.sql.sources.schema" is correct. At first glance, it looks like
>>> the schema data is serialized when sent to Hive but not deserialized
>>> properly on receive.
>>>
>>> I'm tracing execution through source code... but before I get any
>>> deeper, can anyone reproduce this behavior?
>>>
>>> Cheers,
>>>
>>> Christian
>>>
>>> --
>>> Christian Perez
>>> Silicon Valley Data Science
>>> Data Analyst
>>> christian@svds.com
>>> @cp_phd
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>
>



-- 
Christian Perez
Silicon Valley Data Science
Data Analyst
christian@svds.com
@cp_phd

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message