spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Lee <alee...@hotmail.com>
Subject RE: Spark sql failed in yarn-cluster mode when connecting to non-default hive database
Date Tue, 03 Feb 2015 15:55:03 GMT
Hi All,
In Spark 1.2.0-rc1, I have tried to set the hive.metastore.warehouse.dir to share with the
Hive warehouse location on HDFS, however, it does NOT work on yarn-cluster mode. On the Namenode
audit log, I see that spark is trying to access the default hive warehouse location which
is /user/hive/warehouse/spark_hive_test_yarn_cluster_table as oppose to /hive/spark_hive_test_yarn_cluster_table.
A tweaked code snippet from the example looks like this. Compiled and built, submitted in
yarn-cluster mode. (However, it works for yarn-client mode since it can find the hive-site.xml
on the driver machine. But we don't deploy hive-site.xml to all data nodes, this is not standard
to deploy all hive-site.xml to data node, instead, it should be part of the --jars or --files
but it still fails when I do so).








import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive._


object SparkSQLTestCase2HiveContextYarnClusterApp {
 def main(args: Array[String]) {


  val conf = new SparkConf().setAppName("Spark SQL Hive Context TestCase Application")
  val sc = new SparkContext(conf)
  val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)


  import hiveContext._


  // Set default hive warehouse that aligns with /etc/hive/conf/hive-site.xml
  hiveContext.hql("SET hive.metastore.warehouse.dir=hdfs://hive")


  // Create table and clean up data
  hiveContext.hql("CREATE TABLE IF NOT EXISTS spark_hive_test_yarn_cluster_table (key INT,
value STRING)")


  // load sample data from HDFS, need to be uploaded first
  hiveContext.hql("LOAD DATA INPATH 'spark/test/resources/kv1.txt' INTO TABLE spark_hive_test_yarn_cluster_table")


  // Queries are expressed in HiveQL, use collect(), results go into memory, be careful. This
is just
  // a test case. Do NOT use the following line for production, store results to HDFS.
  hiveContext.hql("FROM spark_hive_test_yarn_cluster_table SELECT key, value").collect().foreach(println)


  }
}

From: huaiyin.thu@gmail.com
Date: Wed, 13 Aug 2014 16:56:13 -0400
Subject: Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database
To: linlin200605@gmail.com
CC: lian.cs.zju@gmail.com; user@spark.apache.org

I think the problem is that when you are using yarn-cluster mode, because the Spark driver
runs inside the application master, the hive-conf is not accessible by the driver. Can you
try to set those confs by using hiveContext.set(...)? Or, maybe you can copy hive-site.xml
to spark/conf in the node running the application master.



On Tue, Aug 12, 2014 at 8:38 PM, Jenny Zhao <linlin200605@gmail.com> wrote:



Hi Yin,

hive-site.xml was copied to spark/conf and the same as the one under $HIVE_HOME/conf. 



through hive cli, I don't see any problem. but for spark on yarn-cluster mode, I am not able
to switch to a database other than the default one, for Yarn-client mode, it works fine. 



Thanks!

Jenny




On Tue, Aug 12, 2014 at 12:53 PM, Yin Huai <huaiyin.thu@gmail.com> wrote:

Hi Jenny,
Have you copied hive-site.xml to spark/conf directory? If not, can you put it in conf/ and
try again?





Thanks,
Yin






On Mon, Aug 11, 2014 at 8:57 PM, Jenny Zhao <linlin200605@gmail.com> wrote:






Thanks Yin! 

here is my hive-site.xml,  which I copied from $HIVE_HOME/conf, didn't experience problem
connecting to the metastore through hive. which uses DB2 as metastore database. 







<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
   Licensed to the Apache Software Foundation (ASF) under one or more






   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0






   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0







   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.






   See the License for the specific language governing permissions and
   limitations under the License.
-->
<configuration>
 <property>
  <name>hive.hwi.listen.port</name>
  <value>9999</value>






 </property>
 <property>
  <name>hive.querylog.location</name>
  <value>/var/ibm/biginsights/hive/query/${user.name}</value>





 </property>

 <property>
  <name>hive.metastore.warehouse.dir</name>
  <value>/biginsights/hive/warehouse</value>
 </property>
 <property>
  <name>hive.hwi.war.file</name>






  <value>lib/hive-hwi-0.12.0.war</value>
 </property>
 <property>
  <name>hive.metastore.metrics.enabled</name>
  <value>true</value>
 </property>
 <property>






  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:db2://hdtest022.svl.ibm.com:50001/BIDB</value>
 </property>





 <property>

  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.ibm.db2.jcc.DB2Driver</value>
 </property>
 <property>
  <name>hive.stats.autogather</name>






  <value>false</value>
 </property>
 <property>
  <name>javax.jdo.mapping.Schema</name>
  <value>HIVE</value>
 </property>
 <property>
  <name>javax.jdo.option.ConnectionUserName</name>






  <value>catalog</value>
 </property>
 <property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>V2pJNWMxbFlVbWhaZHowOQ==</value>
 </property>






 <property>
  <name>hive.metastore.password.encrypt</name>
  <value>true</value>
 </property>
 <property>
  <name>org.jpox.autoCreateSchema</name>
  <value>true</value>






 </property>
 <property>
  <name>hive.server2.thrift.min.worker.threads</name>
  <value>5</value>
 </property>
 <property>
  <name>hive.server2.thrift.max.worker.threads</name>






  <value>100</value>
 </property>
 <property>
  <name>hive.server2.thrift.port</name>
  <value>10000</value>
 </property>
 <property>
  <name>hive.server2.thrift.bind.host</name>






  <value>hdtest022.svl.ibm.com</value>
 </property>
 <property>
  <name>hive.server2.authentication</name>
  <value>CUSTOM</value>






 </property>
 <property>
  <name>hive.server2.custom.authentication.class</name>
  <value>org.apache.hive.service.auth.WebConsoleAuthenticationProviderImpl</value>
 </property>






 <property>
  <name>hive.server2.enable.impersonation</name>
  <value>true</value>
 </property>
 <property>
  <name>hive.security.webconsole.url</name>






  <value>http://hdtest022.svl.ibm.com:8080</value>
 </property>
 <property>
  <name>hive.security.authorization.enabled</name>






  <value>true</value>
 </property>
 <property>
  <name>hive.security.authorization.createtable.owner.grants</name>
  <value>ALL</value>
 </property>






</configuration>



On Mon, Aug 11, 2014 at 4:29 PM, Yin Huai <huaiyin.thu@gmail.com> wrote:






Hi Jenny,

How's your metastore configured for both Hive and Spark SQL? Which metastore mode are you
using (based on https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin)?









Thanks,

Yin


On Mon, Aug 11, 2014 at 6:15 PM, Jenny Zhao <linlin200605@gmail.com> wrote:










you can reproduce this issue with the following steps (assuming you have Yarn cluster + Hive
12): 









1) using hive shell, create a database, e.g: create database ttt

   
2) write a simple spark sql program 

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql._
import org.apache.spark.sql.hive.HiveContext










object HiveSpark {
  case class Record(key: Int, value: String)

  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("HiveSpark")
    val sc = new SparkContext(sparkConf)










    // A hive context creates an instance of the Hive Metastore in process,
    val hiveContext = new HiveContext(sc)
    import hiveContext._

    hql("use ttt")
    hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")









    hql("LOAD DATA INPATH '/user/biadmin/kv1.txt' INTO TABLE src")

    // Queries are expressed in HiveQL
    println("Result of 'SELECT *': ")
    hql("SELECT * FROM src").collect.foreach(println)









    sc.stop()
  }
}
3) run it in yarn-cluster mode. 


On Mon, Aug 11, 2014 at 9:44 AM, Cheng Lian <lian.cs.zju@gmail.com> wrote:









Since you were using hql(...), it’s probably not related to JDBC driver. But I failed to
reproduce this issue locally with a single node pseudo distributed YARN cluster. Would you
mind to elaborate more about steps to reproduce this bug? Thanks











​

On Sun, Aug 10, 2014 at 9:36 PM, Cheng Lian <lian.cs.zju@gmail.com> wrote:











Hi Jenny, does this issue only happen when running Spark SQL with YARN in your environment?



On Sat, Aug 9, 2014 at 3:56 AM, Jenny Zhao <linlin200605@gmail.com> wrote:













Hi,

I am able to run my hql query on yarn cluster mode when connecting to the default hive metastore
defined in hive-site.xml. 













however, if I want to switch to a different database, like: 


  hql("use other-database") 


it only works in yarn client mode, but failed on yarn-cluster mode with the following stack:


14/08/08 12:09:11 INFO HiveMetaStore: 0: get_database: tt
14/08/08 12:09:11 INFO audit: ugi=biadmin	ip=unknown-ip-addr	cmd=get_database: tt	
14/08/08 12:09:11 ERROR RetryingHMSHandler: NoSuchObjectException(message:There is no database
named tt)
	at org.apache.hadoop.hive.metastore.ObjectStore.getMDatabase(ObjectStore.java:431)
	at org.apache.hadoop.hive.metastore.ObjectStore.getDatabase(ObjectStore.java:441)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
	at java.lang.reflect.Method.invoke(Method.java:611)
	at org.apache.hadoop.hive.metastore.RetryingRawStore.invoke(RetryingRawStore.java:124)
	at $Proxy15.getDatabase(Unknown Source)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_database(HiveMetaStore.java:628)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
	at java.lang.reflect.Method.invoke(Method.java:611)
	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:103)
	at $Proxy17.get_database(Unknown Source)
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabase(HiveMetaStoreClient.java:810)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
	at java.lang.reflect.Method.invoke(Method.java:611)
	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
	at $Proxy18.getDatabase(Unknown Source)
	at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1139)
	at org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1128)
	at org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase(DDLTask.java:3479)
	at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:237)
	at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
	at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
	at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
	at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
	at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:208)
	at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:182)
	at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:272)
	at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:269)
	at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:86)
	at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:91)
	at org.apache.spark.examples.sql.hive.HiveSpark$.main(HiveSpark.scala:35)
	at org.apache.spark.examples.sql.hive.HiveSpark.main(HiveSpark.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
	at java.lang.reflect.Method.invoke(Method.java:611)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:186)

14/08/08 12:09:11 ERROR DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: Database
does not exist: tt
	at org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase(DDLTask.java:3480)
	at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:237)
	at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
	at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
	at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
	at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
	at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:208)
	at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:182)
	at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:272)
	at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:269)
	at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:86)
	at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:91)
	at org.apache.spark.examples.sql.hive.HiveSpark$.main(HiveSpark.scala:35)
	at org.apache.spark.examples.sql.hive.HiveSpark.main(HiveSpark.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
	at java.lang.reflect.Method.invoke(Method.java:611)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:186)
nono    why is that? not sure if this is something to do with hive jdbc driver? 

Thank you!


Jenny
















 		 	   		  
Mime
View raw message