spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jianshi Huang <jianshi.hu...@gmail.com>
Subject Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)
Date Tue, 09 Dec 2014 05:31:24 GMT
Ah... I see. Thanks for pointing it out.

Then it means we cannot mount external table using customized column names.
hmm...

Then the only option left is to use a subquery to add a bunch of column
alias. I'll try it later.

Thanks,
Jianshi

On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust <michael@databricks.com>
wrote:

> This is by hive's design.  From the Hive documentation:
>
> The column change command will only modify Hive's metadata, and will not
>> modify data. Users should make sure the actual data layout of the
>> table/partition conforms with the metadata definition.
>
>
>
> On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang <jianshi.huang@gmail.com>
> wrote:
>
>> Ok, found another possible bug in Hive.
>>
>> My current solution is to use ALTER TABLE CHANGE to rename the column
>> names.
>>
>> The problem is after renaming the column names, the value of the columns
>> became all NULL.
>>
>> Before renaming:
>> scala> sql("select `sorted::cre_ts` from pmt limit 1").collect
>> res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])
>>
>> Execute renaming:
>> scala> sql("alter table pmt change `sorted::cre_ts` cre_ts string")
>> res13: org.apache.spark.sql.SchemaRDD =
>> SchemaRDD[972] at RDD at SchemaRDD.scala:108
>> == Query Plan ==
>> <Native command: executed by Hive>
>>
>> After renaming:
>> scala> sql("select cre_ts from pmt limit 1").collect
>> res16: Array[org.apache.spark.sql.Row] = Array([null])
>>
>> I created a JIRA for it:
>>
>>   https://issues.apache.org/jira/browse/SPARK-4781
>>
>>
>> Jianshi
>>
>> On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang <jianshi.huang@gmail.com>
>> wrote:
>>
>>> Hmm... another issue I found doing this approach is that ANALYZE TABLE
>>> ... COMPUTE STATISTICS will fail to attach the metadata to the table, and
>>> later broadcast join and such will fail...
>>>
>>> Any idea how to fix this issue?
>>>
>>> Jianshi
>>>
>>> On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang <jianshi.huang@gmail.com>
>>> wrote:
>>>
>>>> Very interesting, the line doing drop table will throws an exception.
>>>> After removing it all works.
>>>>
>>>> Jianshi
>>>>
>>>> On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang <jianshi.huang@gmail.com>
>>>> wrote:
>>>>
>>>>> Here's the solution I got after talking with Liancheng:
>>>>>
>>>>> 1) using backquote `..` to wrap up all illegal characters
>>>>>
>>>>>     val rdd = parquetFile(file)
>>>>>     val schema = rdd.schema.fields.map(f => s"`${f.name}`
>>>>> ${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n")
>>>>>
>>>>>     val ddl_13 = s"""
>>>>>       |CREATE EXTERNAL TABLE $name (
>>>>>       |  $schema
>>>>>       |)
>>>>>       |STORED AS PARQUET
>>>>>       |LOCATION '$file'
>>>>>       """.stripMargin
>>>>>
>>>>>     sql(ddl_13)
>>>>>
>>>>> 2) create a new Schema and do applySchema to generate a new SchemaRDD,
>>>>> had to drop and register table
>>>>>
>>>>>     val t = table(name)
>>>>>     val newSchema = StructType(t.schema.fields.map(s => s.copy(name
=
>>>>> s.name.replaceAll(".*?::", ""))))
>>>>>     sql(s"drop table $name")
>>>>>     applySchema(t, newSchema).registerTempTable(name)
>>>>>
>>>>> I'm testing it for now.
>>>>>
>>>>> Thanks for the help!
>>>>>
>>>>>
>>>>> Jianshi
>>>>>
>>>>> On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang <jianshi.huang@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I had to use Pig for some preprocessing and to generate Parquet files
>>>>>> for Spark to consume.
>>>>>>
>>>>>> However, due to Pig's limitation, the generated schema contains Pig's
>>>>>> identifier
>>>>>>
>>>>>> e.g.
>>>>>> sorted::id, sorted::cre_ts, ...
>>>>>>
>>>>>> I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.
>>>>>>
>>>>>>   create external table pmt (
>>>>>>     sorted::id bigint
>>>>>>   )
>>>>>>   stored as parquet
>>>>>>   location '...'
>>>>>>
>>>>>> Obviously it didn't work, I also tried removing the identifier
>>>>>> sorted::, but the resulting rows contain only nulls.
>>>>>>
>>>>>> Any idea how to create a table in HiveContext from these Parquet
>>>>>> files?
>>>>>>
>>>>>> Thanks,
>>>>>> Jianshi
>>>>>> --
>>>>>> Jianshi Huang
>>>>>>
>>>>>> LinkedIn: jianshi
>>>>>> Twitter: @jshuang
>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jianshi Huang
>>>>>
>>>>> LinkedIn: jianshi
>>>>> Twitter: @jshuang
>>>>> Github & Blog: http://huangjs.github.com/
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jianshi Huang
>>>>
>>>> LinkedIn: jianshi
>>>> Twitter: @jshuang
>>>> Github & Blog: http://huangjs.github.com/
>>>>
>>>
>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>


-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message