spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From patcharee <Patcharee.Thong...@uni.no>
Subject Re: hiveContext.sql NullPointerException
Date Thu, 11 Jun 2015 21:01:15 GMT
Hi,

Does 
df.write.partitionBy("partitions").format("format").mode("overwrite").saveAsTable("tbl") 
support orc file?

I tried df.write.partitionBy("zone", "z", "year", 
"month").format("orc").mode("overwrite").saveAsTable("tbl"), but after 
the insert my table "tbl" schema has been changed to something I did not 
expected ..

-- FROM --
CREATE EXTERNAL TABLE `4dim`(`u` float,   `v` float)
PARTITIONED BY (`zone` int, `z` int, `year` int, `month` int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
   'orc.compress'='ZLIB',
   'transient_lastDdlTime'='1433016475')

-- TO --
CREATE TABLE `4dim`(`col` array<string> COMMENT 'from deserializer')
PARTITIONED BY (`zone` int COMMENT '', `z` int COMMENT '', `year` int 
COMMENT '', `month` int COMMENT '')
ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
TBLPROPERTIES (
   'EXTERNAL'='FALSE',
   'spark.sql.sources.provider'='orc',
   'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"u\",\"type\":\"float\",\"nullable\":true,\"metadata\":{}},{\"name\":\"v\",\"type\":\"float\",\"nullable\":true,\"metadata\":{}},{\"name\":\"zone\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"z\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"year\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"month\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}',


   'transient_lastDdlTime'='1434055247')


I noticed there are files stored in hdfs as *.orc, but when I tried to 
query from hive I got nothing. How can I fix this? Any suggestions please

BR,
Patcharee


On 07. juni 2015 16:40, Cheng Lian wrote:
> Spark SQL supports Hive dynamic partitioning, so one possible 
> workaround is to create a Hive table partitioned by zone, z, year, and 
> month dynamically, and then insert the whole dataset into it directly.
>
> In 1.4, we also provides dynamic partitioning support for non-Hive 
> environment, and you can do something like this:
>
>     df.write.partitionBy("zone", "z", "year", 
> "month").format("parquet").mode("overwrite").saveAsTable("tbl")
>
> Cheng
>
> On 6/7/15 9:48 PM, patcharee wrote:
>> Hi,
>>
>> How can I expect to work on HiveContext on the executor? If only the 
>> driver can see HiveContext, does it mean I have to collect all 
>> datasets (very large) to the driver and use HiveContext there? It 
>> will be memory overload on the driver and fail.
>>
>> BR,
>> Patcharee
>>
>> On 07. juni 2015 11:51, Cheng Lian wrote:
>>> Hi,
>>>
>>> This is expected behavior. HiveContext.sql (and also 
>>> DataFrame.registerTempTable) is only expected to be invoked on 
>>> driver side. However, the closure passed to RDD.foreach is executed 
>>> on executor side, where no viable HiveContext instance exists.
>>>
>>> Cheng
>>>
>>> On 6/7/15 10:06 AM, patcharee wrote:
>>>> Hi,
>>>>
>>>> I try to insert data into a partitioned hive table. The groupByKey 
>>>> is to combine dataset into a partition of the hive table. After the 
>>>> groupByKey, I converted the iterable[X] to DB by X.toList.toDF(). 
>>>> But the hiveContext.sql  throws NullPointerException, see below. 
>>>> Any suggestions? What could be wrong? Thanks!
>>>>
>>>> val varWHeightFlatRDD = 
>>>> varWHeightRDD.flatMap(FlatMapUtilClass().flatKeyFromWrf).groupByKey()
>>>>       .foreach(
>>>>         x => {
>>>>           val zone = x._1._1
>>>>           val z = x._1._2
>>>>           val year = x._1._3
>>>>           val month = x._1._4
>>>>           val df_table_4dim = x._2.toList.toDF()
>>>>           df_table_4dim.registerTempTable("table_4Dim")
>>>>           hiveContext.sql("INSERT OVERWRITE table 4dim partition 
>>>> (zone=" + zone + ",z=" + z + ",year=" + year + ",month=" + month + 
>>>> ") " +
>>>>             "select date, hh, x, y, height, u, v, w, ph, phb, t, p, 
>>>> pb, qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim");
>>>>
>>>> })
>>>>
>>>>
>>>> java.lang.NullPointerException
>>>>     at 
>>>> org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:100)
>>>>     at 
>>>> no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:113)
>>>>     at 
>>>> no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:103)
>>>>     at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>>>     at 
>>>> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>>>>     at 
>>>> org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
>>>>     at 
>>>> org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
>>>>     at 
>>>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
>>>>     at 
>>>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
>>>>     at 
>>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>>>>     at org.apache.spark.scheduler.Task.run(Task.scala:64)
>>>>     at 
>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>>>>     at 
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>     at 
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>     at java.lang.Thread.run(Thread.java:744)
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>
>>>
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message