spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)
Date Tue, 09 Dec 2014 19:37:35 GMT
You might also try out the recently added support for views.

On Mon, Dec 8, 2014 at 9:31 PM, Jianshi Huang <jianshi.huang@gmail.com>
wrote:

> Ah... I see. Thanks for pointing it out.
>
> Then it means we cannot mount external table using customized column
> names. hmm...
>
> Then the only option left is to use a subquery to add a bunch of column
> alias. I'll try it later.
>
> Thanks,
> Jianshi
>
> On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust <michael@databricks.com>
> wrote:
>
>> This is by hive's design.  From the Hive documentation:
>>
>> The column change command will only modify Hive's metadata, and will not
>>> modify data. Users should make sure the actual data layout of the
>>> table/partition conforms with the metadata definition.
>>
>>
>>
>> On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang <jianshi.huang@gmail.com>
>> wrote:
>>
>>> Ok, found another possible bug in Hive.
>>>
>>> My current solution is to use ALTER TABLE CHANGE to rename the column
>>> names.
>>>
>>> The problem is after renaming the column names, the value of the columns
>>> became all NULL.
>>>
>>> Before renaming:
>>> scala> sql("select `sorted::cre_ts` from pmt limit 1").collect
>>> res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])
>>>
>>> Execute renaming:
>>> scala> sql("alter table pmt change `sorted::cre_ts` cre_ts string")
>>> res13: org.apache.spark.sql.SchemaRDD =
>>> SchemaRDD[972] at RDD at SchemaRDD.scala:108
>>> == Query Plan ==
>>> <Native command: executed by Hive>
>>>
>>> After renaming:
>>> scala> sql("select cre_ts from pmt limit 1").collect
>>> res16: Array[org.apache.spark.sql.Row] = Array([null])
>>>
>>> I created a JIRA for it:
>>>
>>>   https://issues.apache.org/jira/browse/SPARK-4781
>>>
>>>
>>> Jianshi
>>>
>>> On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang <jianshi.huang@gmail.com>
>>> wrote:
>>>
>>>> Hmm... another issue I found doing this approach is that ANALYZE TABLE
>>>> ... COMPUTE STATISTICS will fail to attach the metadata to the table, and
>>>> later broadcast join and such will fail...
>>>>
>>>> Any idea how to fix this issue?
>>>>
>>>> Jianshi
>>>>
>>>> On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang <jianshi.huang@gmail.com>
>>>> wrote:
>>>>
>>>>> Very interesting, the line doing drop table will throws an exception.
>>>>> After removing it all works.
>>>>>
>>>>> Jianshi
>>>>>
>>>>> On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang <jianshi.huang@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Here's the solution I got after talking with Liancheng:
>>>>>>
>>>>>> 1) using backquote `..` to wrap up all illegal characters
>>>>>>
>>>>>>     val rdd = parquetFile(file)
>>>>>>     val schema = rdd.schema.fields.map(f => s"`${f.name}`
>>>>>> ${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n")
>>>>>>
>>>>>>     val ddl_13 = s"""
>>>>>>       |CREATE EXTERNAL TABLE $name (
>>>>>>       |  $schema
>>>>>>       |)
>>>>>>       |STORED AS PARQUET
>>>>>>       |LOCATION '$file'
>>>>>>       """.stripMargin
>>>>>>
>>>>>>     sql(ddl_13)
>>>>>>
>>>>>> 2) create a new Schema and do applySchema to generate a new
>>>>>> SchemaRDD, had to drop and register table
>>>>>>
>>>>>>     val t = table(name)
>>>>>>     val newSchema = StructType(t.schema.fields.map(s => s.copy(name
=
>>>>>> s.name.replaceAll(".*?::", ""))))
>>>>>>     sql(s"drop table $name")
>>>>>>     applySchema(t, newSchema).registerTempTable(name)
>>>>>>
>>>>>> I'm testing it for now.
>>>>>>
>>>>>> Thanks for the help!
>>>>>>
>>>>>>
>>>>>> Jianshi
>>>>>>
>>>>>> On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang <
>>>>>> jianshi.huang@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I had to use Pig for some preprocessing and to generate Parquet
>>>>>>> files for Spark to consume.
>>>>>>>
>>>>>>> However, due to Pig's limitation, the generated schema contains
>>>>>>> Pig's identifier
>>>>>>>
>>>>>>> e.g.
>>>>>>> sorted::id, sorted::cre_ts, ...
>>>>>>>
>>>>>>> I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.
>>>>>>>
>>>>>>>   create external table pmt (
>>>>>>>     sorted::id bigint
>>>>>>>   )
>>>>>>>   stored as parquet
>>>>>>>   location '...'
>>>>>>>
>>>>>>> Obviously it didn't work, I also tried removing the identifier
>>>>>>> sorted::, but the resulting rows contain only nulls.
>>>>>>>
>>>>>>> Any idea how to create a table in HiveContext from these Parquet
>>>>>>> files?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jianshi
>>>>>>> --
>>>>>>> Jianshi Huang
>>>>>>>
>>>>>>> LinkedIn: jianshi
>>>>>>> Twitter: @jshuang
>>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jianshi Huang
>>>>>>
>>>>>> LinkedIn: jianshi
>>>>>> Twitter: @jshuang
>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jianshi Huang
>>>>>
>>>>> LinkedIn: jianshi
>>>>> Twitter: @jshuang
>>>>> Github & Blog: http://huangjs.github.com/
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jianshi Huang
>>>>
>>>> LinkedIn: jianshi
>>>> Twitter: @jshuang
>>>> Github & Blog: http://huangjs.github.com/
>>>>
>>>
>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>

Mime
View raw message