spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: The Myth: the forked Hive 1.2.1 is stabler than XXX
Date Wed, 20 Nov 2019 19:01:43 GMT
Yeah 'stable' is ambiguous. It's old and buggy, but at least it's the
same old and buggy that's been there a while. "stable" in that sense
I'm sure there is a lot more delta between Hive 1 and 2 in terms of
bug fixes that are important; the question isn't just 1.x releases.

What I don't know is how much affects Spark, as it's a Hive client
mostly. Clearly some do.

I'd prefer making it the default in the POM for 3.0. Mostly on the
grounds that its effects are on deployed clusters, not apps. And
deployers can still choose a binary distro with 1.x or make the choice
they want. Those that don't care should probably be nudged to 2.x.
Spark 3.x is already full of behavior changes and 'unstable', so I
think this is minor relative to the overall risk question.

On Wed, Nov 20, 2019 at 12:53 PM Dongjoon Hyun <dongjoon.hyun@gmail.com> wrote:
>
> Hi, All.
>
> I'm sending this email because it's important to discuss this topic narrowly
> and make a clear conclusion.
>
> `The forked Hive 1.2.1 is stable`? It sounds like a myth we created
> by ignoring the existing bugs. If you want to say the forked Hive 1.2.1 is
> stabler than XXX, please give us the evidence. Then, we can fix it.
> Otherwise, let's stop making `The forked Hive 1.2.1` invincible.
>
> Historically, the following forked Hive 1.2.1 has never been stable.
> It's just frozen. Since the forked Hive is out of our control, we ignored bugs.
> That's all. The reality is a way far from the stable status.
>
>     https://mvnrepository.com/artifact/org.spark-project.hive/
>     https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark (2015
August)
>     https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark2
(2016 April)
>
> First, let's begin Hive itself by comparing with Apache Hive 1.2.2 and 1.2.3,
>
>     Apache Hive 1.2.2 has 50 bug fixes.
>     Apache Hive 1.2.3 has 9 bug fixes.
>
> I will not cover all of them, but Apache Hive community also backports
> important patches like Apache Spark community.
>
> Second, let's move to SPARK issues because we aren't exposed to all Hive issues.
>
>     SPARK-19109 ORC metadata section can sometimes exceed protobuf message size limit
>     SPARK-22267 Spark SQL incorrectly reads ORC file when column order is different
>
> These were reported since Apache Spark 1.6.x because the forked Hive doesn't have
> a proper upstream patch like HIVE-11592 (fixed at Apache Hive 1.3.0).
>
> Since we couldn't update the frozen forked Hive, we added Apache ORC dependency
> at SPARK-20682 (2.3.0), added a switching configuration at SPARK-20728 (2.3.0),
> tured on `spark.sql.hive.convertMetastoreOrc by default` at SPARK-22279 (2.4.0).
> However, if you turn off the switch and start to use the forked hive,
> you will be exposed to the buggy forked Hive 1.2.1 again.
>
> Third, let's talk about the new features like Hadoop 3 and JDK11.
> No one believe that the ancient forked Hive 1.2.1 will work with this.
> I saw that the following issue is mentioned as an evidence of Hive 2.3.6 bug.
>
>     SPARK-29245 ClassCastException during creating HiveMetaStoreClient
>
> Yes. I know that issue because I reported it and verified HIVE-21508.
> It's fixed already and will be released ad Apache Hive 2.3.7.
>
> Can we imagine something like this in the forked Hive 1.2.1?
> 'No'. There is no future on it. It's frozen.
>
> From now, I want to claim that the forked Hive 1.2.1 is the unstable one.
> I welcome all your positive and negative opinions.
> Please share your concerns and problems and fix them together.
> Apache Spark is an open source project we shared.
>
> Bests,
> Dongjoon.
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message