spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jianshi Huang <jianshi.hu...@gmail.com>
Subject Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)
Date Sun, 07 Dec 2014 04:28:48 GMT
Ok, found another possible bug in Hive.

My current solution is to use ALTER TABLE CHANGE to rename the column names.

The problem is after renaming the column names, the value of the columns
became all NULL.

Before renaming:
scala> sql("select `sorted::cre_ts` from pmt limit 1").collect
res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])

Execute renaming:
scala> sql("alter table pmt change `sorted::cre_ts` cre_ts string")
res13: org.apache.spark.sql.SchemaRDD =
SchemaRDD[972] at RDD at SchemaRDD.scala:108
== Query Plan ==
<Native command: executed by Hive>

After renaming:
scala> sql("select cre_ts from pmt limit 1").collect
res16: Array[org.apache.spark.sql.Row] = Array([null])

I created a JIRA for it:

  https://issues.apache.org/jira/browse/SPARK-4781


Jianshi

On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang <jianshi.huang@gmail.com>
wrote:

> Hmm... another issue I found doing this approach is that ANALYZE TABLE ...
> COMPUTE STATISTICS will fail to attach the metadata to the table, and later
> broadcast join and such will fail...
>
> Any idea how to fix this issue?
>
> Jianshi
>
> On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang <jianshi.huang@gmail.com>
> wrote:
>
>> Very interesting, the line doing drop table will throws an exception.
>> After removing it all works.
>>
>> Jianshi
>>
>> On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang <jianshi.huang@gmail.com>
>> wrote:
>>
>>> Here's the solution I got after talking with Liancheng:
>>>
>>> 1) using backquote `..` to wrap up all illegal characters
>>>
>>>     val rdd = parquetFile(file)
>>>     val schema = rdd.schema.fields.map(f => s"`${f.name}`
>>> ${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n")
>>>
>>>     val ddl_13 = s"""
>>>       |CREATE EXTERNAL TABLE $name (
>>>       |  $schema
>>>       |)
>>>       |STORED AS PARQUET
>>>       |LOCATION '$file'
>>>       """.stripMargin
>>>
>>>     sql(ddl_13)
>>>
>>> 2) create a new Schema and do applySchema to generate a new SchemaRDD,
>>> had to drop and register table
>>>
>>>     val t = table(name)
>>>     val newSchema = StructType(t.schema.fields.map(s => s.copy(name =
>>> s.name.replaceAll(".*?::", ""))))
>>>     sql(s"drop table $name")
>>>     applySchema(t, newSchema).registerTempTable(name)
>>>
>>> I'm testing it for now.
>>>
>>> Thanks for the help!
>>>
>>>
>>> Jianshi
>>>
>>> On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang <jianshi.huang@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I had to use Pig for some preprocessing and to generate Parquet files
>>>> for Spark to consume.
>>>>
>>>> However, due to Pig's limitation, the generated schema contains Pig's
>>>> identifier
>>>>
>>>> e.g.
>>>> sorted::id, sorted::cre_ts, ...
>>>>
>>>> I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.
>>>>
>>>>   create external table pmt (
>>>>     sorted::id bigint
>>>>   )
>>>>   stored as parquet
>>>>   location '...'
>>>>
>>>> Obviously it didn't work, I also tried removing the identifier
>>>> sorted::, but the resulting rows contain only nulls.
>>>>
>>>> Any idea how to create a table in HiveContext from these Parquet files?
>>>>
>>>> Thanks,
>>>> Jianshi
>>>> --
>>>> Jianshi Huang
>>>>
>>>> LinkedIn: jianshi
>>>> Twitter: @jshuang
>>>> Github & Blog: http://huangjs.github.com/
>>>>
>>>
>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message