spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Cheung <felixcheun...@hotmail.com>
Subject Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
Date Tue, 15 Jan 2019 17:55:18 GMT
Resolving https://issues.apache.org/jira/browse/HIVE-16391 means to keep Spark on Hive 1.2?

I’m not sure that is reducing dependency on Hive - Hive is still there and it’s a very
old Hive. IMO it is increasing the risk the longer we keep on this. (And it’s been years)

Looking at the two PR. They don’t seem very drastic to me, except for thrift server. Is
there another, better approach to thrift server?


________________________________
From: Xiao Li <gatorsmile@gmail.com>
Sent: Tuesday, January 15, 2019 9:44 AM
To: Yuming Wang
Cc: dev
Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

Hi, Yuming,

Thank you for your contributions! The community aims at reducing the dependence on Hive. Currently,
most of Spark users are not using Hive. The changes looks risky to me.

To support Hadoop 3.x, we just need to resolve this JIRA: https://issues.apache.org/jira/browse/HIVE-16391

Cheers,

Xiao

Yuming Wang <wgyumg@gmail.com<mailto:wgyumg@gmail.com>> 于2019年1月15日周二
上午8:41写道:
Dear Spark Developers and Users,

Hyukjin and I plan to upgrade the built-in Hive from 1.2.1-spark2<https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2>
to 2.3.4<https://github.com/apache/hive/releases/tag/rel%2Frelease-2.3.4> to solve some
critical issues, such as support Hadoop 3.x, solve some ORC and Parquet issues. This is the
list:
Hive issues:
[SPARK-26332<https://issues.apache.org/jira/browse/SPARK-26332>][HIVE-10790] Spark sql
write orc table on viewFS throws exception
[SPARK-25193<https://issues.apache.org/jira/browse/SPARK-25193>][HIVE-12505] insert
overwrite doesn't throw exception when drop old data fails
[SPARK-26437<https://issues.apache.org/jira/browse/SPARK-26437>][HIVE-13083] Decimal
data becomes bigint to query, unable to query
[SPARK-25919<https://issues.apache.org/jira/browse/SPARK-25919>][HIVE-11771] Date value
corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned
[SPARK-12014<https://issues.apache.org/jira/browse/SPARK-12014>][HIVE-11100] Spark SQL
query containing semicolon is broken in Beeline

Spark issues:
[SPARK-23534<https://issues.apache.org/jira/browse/SPARK-23534>] Spark run on Hadoop
3.0.0
[SPARK-20202<https://issues.apache.org/jira/browse/SPARK-20202>] Remove references to
org.spark-project.hive
[SPARK-18673<https://issues.apache.org/jira/browse/SPARK-18673>] Dataframes doesn't
work on Hadoop 3.x; Hive rejects Hadoop version
[SPARK-24766<https://issues.apache.org/jira/browse/SPARK-24766>] CreateHiveTableAsSelect
and InsertIntoHiveDir won't generate decimal column stats in parquet


Since the code for the hive-thriftserver module has changed too much for this upgrade, I split
it into two PRs for easy review.
The first PR<https://github.com/apache/spark/pull/23552> does not contain the changes
of hive-thriftserver. Please ignore the failed test in hive-thriftserver.
The second PR<https://github.com/apache/spark/pull/23553> is complete changes.

I have created a Spark distribution for Apache Hadoop 2.7, you might download it via Google
Drive<https://drive.google.com/open?id=1cq2I8hUTs9F4JkFyvRfdOJ5BlxV0ujgt> or Baidu Pan<https://pan.baidu.com/s/1b090Ctuyf1CDYS7c0puBqQ>.
Please help review and test. Thanks.

Mime
View raw message