spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yi Zhou (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-15034) Use the value of spark.sql.warehouse.dir as the warehouse location instead of using hive.metastore.warehouse.dir
Date Thu, 26 May 2016 12:09:12 GMT

    [ https://issues.apache.org/jira/browse/SPARK-15034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15301988#comment-15301988
] 

Yi Zhou commented on SPARK-15034:
---------------------------------

Hi [~yhuai]
I can integrate Spark SQL with hive metastore well in Spark 1.6. But now I am very confused
about Spark SQL integration with Hive metastore in Spark 2.0. I want to get your help what's
correct steps or required configurations in Spark 2.0 ? My case is that using Spark 2.0 connect
to a existing hive metastore database but it always can't show this existing database via
spark-sql CLI but i can see database via Hive CLI...Please kindly see below my experiment.
Thanks in advance !

Build package command:
{code}
./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.5.1 -Phive
-Phive-thriftserver -DskipTests
{code}

key items in spark-defaults.conf
{code}
spark.sql.hive.metastore.version=1.1.0
spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
spark.executor.extraClassPath=/etc/hive/conf
spark.driver.extraClassPath=/etc/hive/conf
spark.yarn.jars=/usr/lib/spark/jars/*
spark.sql.warehouse.dir=/user/hive/warehouse
{code}

{code}
/usr/lib/spark/bin/spark-sql
spark-sql> show databases;
16/05/26 20:06:04 INFO execution.SparkSqlParser: Parsing command: show databases
16/05/26 20:06:04 INFO log.PerfLogger: <PERFLOG method=create_database from=org.apache.hadoop.hive.metastore.RetryingHMSHandler>
16/05/26 20:06:04 INFO metastore.HiveMetaStore: 0: create_database: Database(name:default,
description:default database, locationUri:hdfs://hw-node2:8020/user/hive/warehouse, parameters:{})
16/05/26 20:06:04 INFO HiveMetaStore.audit: ugi=root    ip=unknown-ip-addr      cmd=create_database:
Database(name:default, description:default database, locationUri:hdfs://hw-node2:8020/user/hive/warehouse,
parameters:{})
16/05/26 20:06:04 ERROR metastore.RetryingHMSHandler: AlreadyExistsException(message:Database
default already exists)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database(HiveMetaStore.java:944)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:138)
        at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:99)
        at com.sun.proxy.$Proxy34.create_database(Unknown Source)
        at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:646)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:105)
        at com.sun.proxy.$Proxy35.createDatabase(Unknown Source)
        at org.apache.hadoop.hive.ql.metadata.Hive.createDatabase(Hive.java:345)
        at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply$mcV$sp(HiveClientImpl.scala:289)
        at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:289)
        at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:289)
        at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:260)
        at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:207)
        at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:206)
        at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:249)
        at org.apache.spark.sql.hive.client.HiveClientImpl.createDatabase(HiveClientImpl.scala:288)
        at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply$mcV$sp(HiveExternalCatalog.scala:94)
        at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:94)
        at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:94)
        at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:68)
        at org.apache.spark.sql.hive.HiveExternalCatalog.createDatabase(HiveExternalCatalog.scala:93)
        at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:142)
        at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:84)
        at org.apache.spark.sql.hive.HiveSessionCatalog.<init>(HiveSessionCatalog.scala:50)
        at org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:49)
        at org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
        at org.apache.spark.sql.hive.HiveSessionState$$anon$1.<init>(HiveSessionState.scala:63)
        at org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
        at org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
        at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
        at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
        at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:532)
        at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:652)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:323)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:239)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:724)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

16/05/26 20:06:04 INFO log.PerfLogger: </PERFLOG method=create_database start=1464264364384
end=1464264364389 duration=5 from=org.apache.hadoop.hive.metastore.RetryingHMSHandler threadId=0
retryCount=-1 error=true>
16/05/26 20:06:04 INFO log.PerfLogger: <PERFLOG method=get_databases from=org.apache.hadoop.hive.metastore.RetryingHMSHandler>
16/05/26 20:06:04 INFO metastore.HiveMetaStore: 0: get_databases: *
16/05/26 20:06:04 INFO HiveMetaStore.audit: ugi=root    ip=unknown-ip-addr      cmd=get_databases:
*
16/05/26 20:06:04 INFO log.PerfLogger: </PERFLOG method=get_databases start=1464264364685
end=1464264364693 duration=8 from=org.apache.hadoop.hive.metastore.RetryingHMSHandler threadId=0
retryCount=0 error=false>
16/05/26 20:06:04 INFO spark.SparkContext: Starting job: processCmd at CliDriver.java:376
16/05/26 20:06:04 INFO scheduler.DAGScheduler: Got job 0 (processCmd at CliDriver.java:376)
with 1 output partitions
16/05/26 20:06:04 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (processCmd at CliDriver.java:376)
16/05/26 20:06:04 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/05/26 20:06:04 INFO scheduler.DAGScheduler: Missing parents: List()
16/05/26 20:06:04 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2]
at processCmd at CliDriver.java:376), which has no missing parents
16/05/26 20:06:05 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated
size 3.9 KB, free 511.1 MB)
16/05/26 20:06:05 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory
(estimated size 2.3 KB, free 511.1 MB)
16/05/26 20:06:05 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.3.11:34060
(size: 2.3 KB, free: 511.1 MB)
16/05/26 20:06:05 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1012
16/05/26 20:06:05 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage
0 (MapPartitionsRDD[2] at processCmd at CliDriver.java:376)
16/05/26 20:06:05 INFO cluster.YarnScheduler: Adding task set 0.0 with 1 tasks
16/05/26 20:06:06 INFO spark.ExecutorAllocationManager: Requesting 1 new executor because
tasks are backlogged (new desired total will be 1)
16/05/26 20:06:10 INFO cluster.YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null)
(192.168.3.15:59021) with ID 1
16/05/26 20:06:10 INFO spark.ExecutorAllocationManager: New executor 1 has registered (new
total is 1)
16/05/26 20:06:10 INFO storage.BlockManagerMasterEndpoint: Registering block manager hw-node5:34475
with 511.1 MB RAM, BlockManagerId(1, hw-node5, 34475)
16/05/26 20:06:10 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 192.168.3.15,
partition 0, PROCESS_LOCAL, 5507 bytes)
16/05/26 20:06:10 INFO cluster.YarnClientSchedulerBackend: Launching task 0 on executor id:
1 hostname: 192.168.3.15.
16/05/26 20:06:10 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on hw-node5:34475
(size: 2.3 KB, free: 511.1 MB)
16/05/26 20:06:12 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in
2694 ms on 192.168.3.15 (1/1)
16/05/26 20:06:12 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed,
from pool
16/05/26 20:06:12 INFO scheduler.DAGScheduler: ResultStage 0 (processCmd at CliDriver.java:376)
finished in 7.774 s
16/05/26 20:06:12 INFO scheduler.DAGScheduler: Job 0 finished: processCmd at CliDriver.java:376,
took 7.974825 s
default
Time taken: 8.81 seconds, Fetched 1 row(s)
16/05/26 20:06:12 INFO CliDriver: Time taken: 8.81 seconds, Fetched 1 row(s)
{code}

> Use the value of spark.sql.warehouse.dir as the warehouse location instead of using hive.metastore.warehouse.dir
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-15034
>                 URL: https://issues.apache.org/jira/browse/SPARK-15034
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Yin Huai
>            Assignee: Yin Huai
>              Labels: release_notes, releasenotes
>             Fix For: 2.0.0
>
>
> Starting from Spark 2.0, spark.sql.warehouse.dir will be the conf to set warehouse location.
We will not use hive.metastore.warehouse.dir.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message