spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yansqrt3 <yanxudong...@gmail.com>
Subject Rename hive orc table caused no content in spark
Date Sun, 08 May 2016 04:08:08 GMT
Hi, 

I'm trying to rename an orc table (either in hive or spark has no
difference). After that, all the content in the table will be invisible in
spark while it is still available in hive. The problem could alway be
recreated by very simple steps:

---------------------------- spark shell output------------------------
scala> sql("select uid from pass_db_uc.uc_user limit 1")
res0: org.apache.spark.sql.DataFrame = [uid: bigint]

scala> .show  
+---+
|uid|
+---+
| 12|
+---+


scala> sql("select uid from pass_db_uc.uc_user limit
1").write.format("orc").saveAsTable("yytest.orc1")
16/05/08 11:10:07 WARN HiveMetaStore: Location:
hdfs://prod-hadoop-master01:9000/user/hive/warehouse/yytest.db/orc1
specified for non-external table:orc1

scala> sql("select * from yytest.orc1").count      <<<< content in table 
res3: Long = 1

scala> sql("alter table yytest.orc1 rename to yytest.orc2")
res4: org.apache.spark.sql.DataFrame = [result: string]

scala> sql("select * from yytest.orc2").count      <<<< after renaming, no
content in table 
res5: Long = 0

scala> sql("alter table yytest.orc2 rename to yytest.orc1")
res6: org.apache.spark.sql.DataFrame = [result: string]

scala> sql("select * from yytest.orc1").count     <<<< renaming it back
recovered content, suspected some metadata error 
res7: Long = 1

---------------------------- spark shell output end ------------------------

On the other side, I tried to use hive shell for some clues, and found that
the content is available, while table schema contains a little wired things. 
Hive is configured to use mr intead of spark. 

---------------------------- hive output -------------------------
hive> select count(*) from yytest.orc2;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the
future versions. Consider using a different execution engine (i.e. tez,
spark) or using Hive 1.X releases.
Query ID = root_20160508114940_c3c5454f-73a3-43fc-bdd9-f264414de88e
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1461140858519_15796, Tracking URL =
http://prod-hadoop-master01:8088/proxy/application_1461140858519_15796/
Kill Command = /home/hadoop/hadoop-2.6.3/bin/hadoop job  -kill
job_1461140858519_15796
Hadoop job information for Stage-1: number of mappers: 1; number of
reducers: 1
2016-05-08 11:52:15,326 Stage-1 map = 0%,  reduce = 0%
2016-05-08 11:52:20,594 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
1.25 sec
2016-05-08 11:52:26,868 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
2.85 sec
MapReduce Total cumulative CPU time: 2 seconds 850 msec
Ended Job = job_1461140858519_15796
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.85 sec   HDFS Read:
8310 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 850 msec
OK
1      
<--------------------------------------------------------------------
Content in table
Time taken: 20.971 seconds, Fetched: 1 row(s)
hive>

hive> show create table yytest.orc2;
OK
CREATE TABLE `yytest.orc2`(
  `uid` bigint COMMENT '')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
 
'path'='hdfs://prod-hadoop-master01:9000/user/hive/warehouse/yytest.db/orc1') 
<--- serde properties not match new table, but manually correct that does
not take effect. Still no content in spark
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://prod-hadoop-master01:9000/user/hive/warehouse/yytest.db/orc2'  
<---- Checked in hdfs, content is ok;
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='false',
  'EXTERNAL'='FALSE',
  'last_modified_by'='root',
  'last_modified_time'='1462679372',
  'numFiles'='1',
  'numRows'='-1',
  'rawDataSize'='-1',
  'spark.sql.sources.provider'='orc',
  'spark.sql.sources.schema.numParts'='1',
 
'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"uid\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}',
  'totalSize'='196',
  'transient_lastDdlTime'='1462679372')
Time taken: 1.258 seconds, Fetched: 25 row(s)

---------------------------- hive output -------------------------

Here are the versions I'm using:
Hadoop 2.6.3
Hive 2.0.0
Spark 1.6 - build with hive support, depend on scala 2.10

Is there any idea on why content missing after rename, or any suggestion on
that? Thanks a lot.

Regars, 
Xudong




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Rename-hive-orc-table-caused-no-content-in-spark-tp26897.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message