spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: The Myth: the forked Hive 1.2.1 is stabler than XXX
Date Thu, 21 Nov 2019 00:10:10 GMT
Oh, actually, in order to decouple Hadoop 3.2 and Hive 2.3 upgrades, we
will need a hive-2.3 profile anyway, no matter having the hive-1.2 profile
or not.

On Wed, Nov 20, 2019 at 3:33 PM Cheng Lian <lian.cs.zju@gmail.com> wrote:

> Just to summarize my points:
>
>    1. Let's still keep the Hive 1.2 dependency in Spark 3.0, but it is
>    optional. End-users may choose between Hive 1.2/2.3 via a new profile
>    (either adding a hive-1.2 profile or adding a hive-2.3 profile works for
>    me, depending on which Hive version we pick as the default version).
>    2. Decouple Hive version upgrade and Hadoop version upgrade, so that
>    people may have more choices in production, and makes Spark 3.0 migration
>    easier (e.g., you don't have to switch to Hadoop 3 in order to pick Hive
>    2.3 and/or JDK 11.).
>    3. For default Hadoop/Hive versions in Spark 3.0, I personally do not
>    have a preference as long as the above two are met.
>
>
> On Wed, Nov 20, 2019 at 3:22 PM Cheng Lian <lian.cs.zju@gmail.com> wrote:
>
>> Dongjoon, I don't think we have any conflicts here. As stated in other
>> threads multiple times, as long as Hive 2.3 and Hadoop 3.2 version upgrades
>> can be decoupled, I have no preference over picking which Hive/Hadoop
>> version as the default version. So the following two plans both work for me:
>>
>>    1. Keep Hive 1.2 as default Spark 3.0 execution Hive version, and
>>    have an extra hive-2.3 profile.
>>    2. Choose Hive 2.3 as default Spark 3.0 execution Hive version, and
>>    have an extra hive-1.2 profile.
>>
>> BTW, I was also discussing Hive dependency issues with other people
>> offline, and I realized that the Hive isolated client loader is not well
>> known, and caused unnecessary confusion/worry. So I would like to provide
>> some background context for readers who are not familiar with Spark Hive
>> integration here. *Building Spark 3.0 with Hive 1.2.1 does NOT mean that
>> you can only interact with Hive 1.2.1.*
>>
>> Spark does work with different versions of Hive metastore via an isolated
>> classloading mechanism. *Even if Spark itself is built with the Hive
>> 1.2.1 fork, you can still interact with a Hive 2.3 metastore, and this has
>> been true ever since Spark 1.x.* In order to do this, just set the
>> following two options according to instructions in our official doc page
>> <http://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore>
>> :
>>
>>    - spark.sql.hive.metastore.version
>>    - spark.sql.hive.metastore.jars
>>
>> Say you set "spark.sql.hive.metastore.version" to "2.3.6", and
>> "spark.sql.hive.metastore.jars" to "maven", Spark will pull Hive 2.3.6
>> dependencies from Maven at runtime when initializing the Hive metastore
>> client. And those dependencies will NOT conflict with the built-in Hive
>> 1.2.1 jars, because the downloaded jars are loaded using an isolated
>> classloader (see here
>> <https://github.com/apache/spark/blob/1febd373ea806326d269a60048ee52543a76c918/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala>).
>> Historically, we call these two sets of Hive dependencies "execution Hive"
>> and "metastore Hive". The former is mostly used for features like SerDe,
>> while the latter is used to interact with Hive metastore. And the Hive
>> version upgrade we are discussing here is about the execution Hive.
>>
>> Cheng
>>
>> On Wed, Nov 20, 2019 at 2:38 PM Dongjoon Hyun <dongjoon.hyun@gmail.com>
>> wrote:
>>
>>> Nice. That's a progress.
>>>
>>> Let's narrow down to the path. We need to clarify what is the criteria
>>> we can agree.
>>>
>>> 1. What does `battle-tested for years` mean exactly?
>>>     How and when can we start the `battle-tested` stage for Hive 2.3?
>>>
>>> 2. What is the new "Hive integration in Spark"?
>>>     During introducing Hive 2.3, we fixed the compatibility stuff as you
>>> said.
>>>     Most of code is shared for Hive 1.2 and Hive 2.3.
>>>     That means if there is a bug inside this shared code, both of them
>>> will be affected.
>>>     Of course, we can fix this because it's Spark code. We will learn
>>> and fix it as you said.
>>>
>>>     >  Yes, there are issues, but people have learned how to get along
>>> with these issues.
>>>
>>>     The only non-shared code are the following.
>>>     Do you have a concern on the following directories?
>>>     If there is no bugs on the following codebase, can we switch?
>>>
>>>     $ find . -name v2.3.5
>>>     ./sql/core/v2.3.5
>>>     ./sql/hive-thriftserver/v2.3.5
>>>
>>> 3. We know that we can keep both code bases, but the community should
>>> choose Hive 2.3 officially.
>>>     That's the right choice in the Apache project policy perspective. At
>>> least, Sean and I prefer that.
>>>     If someone really want to stick to Hive 1.2 fork, they can use it at
>>> their own risks.
>>>
>>>     > for Spark 3.0 end-users who really don't want to interact with
>>> this Hive 1.2 fork, they can always use Hive 2.3 at their own risks.
>>>
>>> Specifically, what about having a profile `hive-1.2` at `3.0.0` with the
>>> default Hive 2.3 pom at least?
>>> How do you think about that way, Cheng?
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Wed, Nov 20, 2019 at 12:59 PM Cheng Lian <lian.cs.zju@gmail.com>
>>> wrote:
>>>
>>>> Hey Dongjoon and Felix,
>>>>
>>>> I totally agree that Hive 2.3 is more stable than Hive 1.2. Otherwise,
>>>> we wouldn't even consider integrating with Hive 2.3 in Spark 3.0.
>>>>
>>>> However, *"Hive" and "Hive integration in Spark" are two quite
>>>> different things*, and I don't think anybody has ever mentioned "the
>>>> forked Hive 1.2.1 is stable" in any recent Hadoop/Hive version discussions
>>>> (at least I double-checked all my replies).
>>>>
>>>> What I really care about is the stability and quality of "Hive
>>>> integration in Spark", which have gone through some major updates due to
>>>> the recent Hive 2.3 upgrade in Spark 3.0. We had already found bugs in this
>>>> piece, and empirically, for a significant upgrade like this one, it is not
>>>> surprising that other bugs/regressions can be found in the near future. On
>>>> the other hand, the Hive 1.2 integration code path in Spark has been
>>>> battle-tested for years. Yes, there are issues, but people have learned how
>>>> to get along with these issues. And please don't forget that, for Spark 3.0
>>>> end-users who really don't want to interact with this Hive 1.2 fork, they
>>>> can always use Hive 2.3 at their own risks.
>>>>
>>>> True, "stable" is quite vague a criterion, and hard to be proven. But
>>>> that is exactly the reason why we may want to be conservative and wait for
>>>> some time and see whether there are further signals suggesting that the
>>>> Hive 2.3 integration in Spark 3.0 is *unstable*. After one or two
>>>> Spark 3.x minor releases, if we've fixed all the outstanding issues and no
>>>> more significant ones are showing up, we can declare that the Hive 2.3
>>>> integration in Spark 3.x is stable, and then we can consider removing
>>>> reference to the Hive 1.2 fork. Does that make sense?
>>>>
>>>> Cheng
>>>>
>>>> On Wed, Nov 20, 2019 at 11:49 AM Felix Cheung <
>>>> felixcheung_m@hotmail.com> wrote:
>>>>
>>>>> Just to add - hive 1.2 fork is definitely not more stable. We know of
>>>>> a few critical bug fixes that we cherry picked into a fork of that fork
to
>>>>> maintain ourselves.
>>>>>
>>>>>
>>>>> ------------------------------
>>>>> *From:* Dongjoon Hyun <dongjoon.hyun@gmail.com>
>>>>> *Sent:* Wednesday, November 20, 2019 11:07:47 AM
>>>>> *To:* Sean Owen <srowen@gmail.com>
>>>>> *Cc:* dev <dev@spark.apache.org>
>>>>> *Subject:* Re: The Myth: the forked Hive 1.2.1 is stabler than XXX
>>>>>
>>>>> Thanks. That will be a giant step forward, Sean!
>>>>>
>>>>> > I'd prefer making it the default in the POM for 3.0.
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>> On Wed, Nov 20, 2019 at 11:02 AM Sean Owen <srowen@gmail.com> wrote:
>>>>>
>>>>> Yeah 'stable' is ambiguous. It's old and buggy, but at least it's the
>>>>> same old and buggy that's been there a while. "stable" in that sense
>>>>> I'm sure there is a lot more delta between Hive 1 and 2 in terms of
>>>>> bug fixes that are important; the question isn't just 1.x releases.
>>>>>
>>>>> What I don't know is how much affects Spark, as it's a Hive client
>>>>> mostly. Clearly some do.
>>>>>
>>>>> I'd prefer making it the default in the POM for 3.0. Mostly on the
>>>>> grounds that its effects are on deployed clusters, not apps. And
>>>>> deployers can still choose a binary distro with 1.x or make the choice
>>>>> they want. Those that don't care should probably be nudged to 2.x.
>>>>> Spark 3.x is already full of behavior changes and 'unstable', so I
>>>>> think this is minor relative to the overall risk question.
>>>>>
>>>>> On Wed, Nov 20, 2019 at 12:53 PM Dongjoon Hyun <
>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>> >
>>>>> > Hi, All.
>>>>> >
>>>>> > I'm sending this email because it's important to discuss this topic
>>>>> narrowly
>>>>> > and make a clear conclusion.
>>>>> >
>>>>> > `The forked Hive 1.2.1 is stable`? It sounds like a myth we created
>>>>> > by ignoring the existing bugs. If you want to say the forked Hive
>>>>> 1.2.1 is
>>>>> > stabler than XXX, please give us the evidence. Then, we can fix
it.
>>>>> > Otherwise, let's stop making `The forked Hive 1.2.1` invincible.
>>>>> >
>>>>> > Historically, the following forked Hive 1.2.1 has never been stable.
>>>>> > It's just frozen. Since the forked Hive is out of our control, we
>>>>> ignored bugs.
>>>>> > That's all. The reality is a way far from the stable status.
>>>>> >
>>>>> >     https://mvnrepository.com/artifact/org.spark-project.hive/
>>>>> >
>>>>> https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark
>>>>> (2015 August)
>>>>> >
>>>>> https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark2
>>>>> (2016 April)
>>>>> >
>>>>> > First, let's begin Hive itself by comparing with Apache Hive 1.2.2
>>>>> and 1.2.3,
>>>>> >
>>>>> >     Apache Hive 1.2.2 has 50 bug fixes.
>>>>> >     Apache Hive 1.2.3 has 9 bug fixes.
>>>>> >
>>>>> > I will not cover all of them, but Apache Hive community also
>>>>> backports
>>>>> > important patches like Apache Spark community.
>>>>> >
>>>>> > Second, let's move to SPARK issues because we aren't exposed to
all
>>>>> Hive issues.
>>>>> >
>>>>> >     SPARK-19109 ORC metadata section can sometimes exceed protobuf
>>>>> message size limit
>>>>> >     SPARK-22267 Spark SQL incorrectly reads ORC file when column
>>>>> order is different
>>>>> >
>>>>> > These were reported since Apache Spark 1.6.x because the forked
Hive
>>>>> doesn't have
>>>>> > a proper upstream patch like HIVE-11592 (fixed at Apache Hive 1.3.0).
>>>>> >
>>>>> > Since we couldn't update the frozen forked Hive, we added Apache
ORC
>>>>> dependency
>>>>> > at SPARK-20682 (2.3.0), added a switching configuration at
>>>>> SPARK-20728 (2.3.0),
>>>>> > tured on `spark.sql.hive.convertMetastoreOrc by default` at
>>>>> SPARK-22279 (2.4.0).
>>>>> > However, if you turn off the switch and start to use the forked
hive,
>>>>> > you will be exposed to the buggy forked Hive 1.2.1 again.
>>>>> >
>>>>> > Third, let's talk about the new features like Hadoop 3 and JDK11.
>>>>> > No one believe that the ancient forked Hive 1.2.1 will work with
>>>>> this.
>>>>> > I saw that the following issue is mentioned as an evidence of Hive
>>>>> 2.3.6 bug.
>>>>> >
>>>>> >     SPARK-29245 ClassCastException during creating
>>>>> HiveMetaStoreClient
>>>>> >
>>>>> > Yes. I know that issue because I reported it and verified HIVE-21508.
>>>>> > It's fixed already and will be released ad Apache Hive 2.3.7.
>>>>> >
>>>>> > Can we imagine something like this in the forked Hive 1.2.1?
>>>>> > 'No'. There is no future on it. It's frozen.
>>>>> >
>>>>> > From now, I want to claim that the forked Hive 1.2.1 is the unstable
>>>>> one.
>>>>> > I welcome all your positive and negative opinions.
>>>>> > Please share your concerns and problems and fix them together.
>>>>> > Apache Spark is an open source project we shared.
>>>>> >
>>>>> > Bests,
>>>>> > Dongjoon.
>>>>> >
>>>>>
>>>>>

Mime
View raw message