spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)
Date Mon, 08 Dec 2014 19:34:47 GMT
This is by hive's design.  From the Hive documentation:

The column change command will only modify Hive's metadata, and will not
> modify data. Users should make sure the actual data layout of the
> table/partition conforms with the metadata definition.



On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang <jianshi.huang@gmail.com>
wrote:

> Ok, found another possible bug in Hive.
>
> My current solution is to use ALTER TABLE CHANGE to rename the column
> names.
>
> The problem is after renaming the column names, the value of the columns
> became all NULL.
>
> Before renaming:
> scala> sql("select `sorted::cre_ts` from pmt limit 1").collect
> res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])
>
> Execute renaming:
> scala> sql("alter table pmt change `sorted::cre_ts` cre_ts string")
> res13: org.apache.spark.sql.SchemaRDD =
> SchemaRDD[972] at RDD at SchemaRDD.scala:108
> == Query Plan ==
> <Native command: executed by Hive>
>
> After renaming:
> scala> sql("select cre_ts from pmt limit 1").collect
> res16: Array[org.apache.spark.sql.Row] = Array([null])
>
> I created a JIRA for it:
>
>   https://issues.apache.org/jira/browse/SPARK-4781
>
>
> Jianshi
>
> On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang <jianshi.huang@gmail.com>
> wrote:
>
>> Hmm... another issue I found doing this approach is that ANALYZE TABLE
>> ... COMPUTE STATISTICS will fail to attach the metadata to the table, and
>> later broadcast join and such will fail...
>>
>> Any idea how to fix this issue?
>>
>> Jianshi
>>
>> On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang <jianshi.huang@gmail.com>
>> wrote:
>>
>>> Very interesting, the line doing drop table will throws an exception.
>>> After removing it all works.
>>>
>>> Jianshi
>>>
>>> On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang <jianshi.huang@gmail.com>
>>> wrote:
>>>
>>>> Here's the solution I got after talking with Liancheng:
>>>>
>>>> 1) using backquote `..` to wrap up all illegal characters
>>>>
>>>>     val rdd = parquetFile(file)
>>>>     val schema = rdd.schema.fields.map(f => s"`${f.name}`
>>>> ${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n")
>>>>
>>>>     val ddl_13 = s"""
>>>>       |CREATE EXTERNAL TABLE $name (
>>>>       |  $schema
>>>>       |)
>>>>       |STORED AS PARQUET
>>>>       |LOCATION '$file'
>>>>       """.stripMargin
>>>>
>>>>     sql(ddl_13)
>>>>
>>>> 2) create a new Schema and do applySchema to generate a new SchemaRDD,
>>>> had to drop and register table
>>>>
>>>>     val t = table(name)
>>>>     val newSchema = StructType(t.schema.fields.map(s => s.copy(name =
>>>> s.name.replaceAll(".*?::", ""))))
>>>>     sql(s"drop table $name")
>>>>     applySchema(t, newSchema).registerTempTable(name)
>>>>
>>>> I'm testing it for now.
>>>>
>>>> Thanks for the help!
>>>>
>>>>
>>>> Jianshi
>>>>
>>>> On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang <jianshi.huang@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I had to use Pig for some preprocessing and to generate Parquet files
>>>>> for Spark to consume.
>>>>>
>>>>> However, due to Pig's limitation, the generated schema contains Pig's
>>>>> identifier
>>>>>
>>>>> e.g.
>>>>> sorted::id, sorted::cre_ts, ...
>>>>>
>>>>> I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.
>>>>>
>>>>>   create external table pmt (
>>>>>     sorted::id bigint
>>>>>   )
>>>>>   stored as parquet
>>>>>   location '...'
>>>>>
>>>>> Obviously it didn't work, I also tried removing the identifier
>>>>> sorted::, but the resulting rows contain only nulls.
>>>>>
>>>>> Any idea how to create a table in HiveContext from these Parquet files?
>>>>>
>>>>> Thanks,
>>>>> Jianshi
>>>>> --
>>>>> Jianshi Huang
>>>>>
>>>>> LinkedIn: jianshi
>>>>> Twitter: @jshuang
>>>>> Github & Blog: http://huangjs.github.com/
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jianshi Huang
>>>>
>>>> LinkedIn: jianshi
>>>> Twitter: @jshuang
>>>> Github & Blog: http://huangjs.github.com/
>>>>
>>>
>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message