spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)
Date Wed, 20 Nov 2019 00:51:56 GMT
Thanks for taking care of this, Dongjoon!

We can target SPARK-20202 to 3.1.0, but I don't think we should do it
immediately after cutting the branch-3.0. The Hive 1.2 code paths can only
be removed once the Hive 2.3 code paths are proven to be stable. If it
turned out to be buggy in Spark 3.1, we may want to further postpone
SPARK-20202 to 3.2.0 by then.

On Tue, Nov 19, 2019 at 2:53 PM Dongjoon Hyun <dongjoon.hyun@gmail.com>
wrote:

> Yes. It does. I meant SPARK-20202.
>
> Thanks. I understand that it can be considered like Scala version issue.
> So, that's the reason why I put this as a `policy` issue from the
> beginning.
>
> > First of all, I want to put this as a policy issue instead of a
> technical issue.
>
> In the policy perspective, we should remove this immediately if we have a
> solution to fix this.
> For now, I set `Target Versions` of SPARK-20202 to `3.1.0` according to
> the current discussion status.
>
>     https://issues.apache.org/jira/browse/SPARK-20202
>
> And, if there is no other issues, I'll create a PR to remove it from
> `master` branch when we cut `branch-3.0`.
>
> For additional `hadoop-2.7 with Hive 2.3` pre-built distribution, how do
> you think about this, Sean?
> The preparation is already started in another email thread and I believe
> that is a keystone to prove `Hive 2.3` version stability
> (which Cheng/Hyukjin/you asked).
>
> Bests,
> Dongjoon.
>
>
> On Tue, Nov 19, 2019 at 2:09 PM Cheng Lian <lian.cs.zju@gmail.com> wrote:
>
>> It's kinda like Scala version upgrade. Historically, we only remove the
>> support of an older Scala version when the newer version is proven to be
>> stable after one or more Spark minor versions.
>>
>> On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian <lian.cs.zju@gmail.com> wrote:
>>
>>> Hmm, what exactly did you mean by "remove the usage of forked `hive` in
>>> Apache Spark 3.0 completely officially"? I thought you wanted to remove the
>>> forked Hive 1.2 dependencies completely, no? As long as we still keep the
>>> Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a
>>> particular preference between using Hive 1.2 or 2.3 as the default Hive
>>> version. After all, for end-users and providers who need a particular
>>> version combination, they can always build Spark with proper profiles
>>> themselves.
>>>
>>> And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that
>>> it's due to the folder name.
>>>
>>> On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun <dongjoon.hyun@gmail.com>
>>> wrote:
>>>
>>>> BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.
>>>>
>>>> For directory name, we use '1.2.1' and '2.3.5' because we just delayed
>>>> the renaming the directories until 3.0.0 deadline to minimize the diff.
>>>>
>>>> We can replace it immediately if we want right now.
>>>>
>>>>
>>>>
>>>> On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun <dongjoon.hyun@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, Cheng.
>>>>>
>>>>> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
>>>>> If we consider them, it could be the followings.
>>>>>
>>>>> +----------+-----------------+--------------------+
>>>>> |          | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
>>>>> +-------------------------------------------------+
>>>>> |Legitimate|        X        |         O          |
>>>>> |JDK11     |        X        |         O          |
>>>>> |Hadoop3   |        X        |         O          |
>>>>> |Hadoop2   |        O        |         O          |
>>>>> |Functions |     Baseline    |       More         |
>>>>> |Bug fixes |     Baseline    |       More         |
>>>>> +-------------------------------------------------+
>>>>>
>>>>> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
>>>>> (including Jenkins/GitHubAction/AppVeyor).
>>>>>
>>>>> For me, AS-IS 3.0 is not enough for that. According to your advices,
>>>>> to give more visibility to the whole community,
>>>>>
>>>>> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
>>>>> distribution
>>>>> 2. We need to switch our default Hive usage to 2.3 in `master` for 3.1
>>>>> after `branch-3.0` branch cut.
>>>>>
>>>>> I know that we have been reluctant to (1) and (2) due to its burden.
>>>>> But, it's time to prepare. Without them, we are going to be
>>>>> insufficient again and again.
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian <lian.cs.zju@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x
>>>>>> minor release to stabilize Hive 2.3 code paths before retiring the
Hive 1.2
>>>>>> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is
still
>>>>>> buggy in terms of JDK 11 support. (BTW, I just found that our root
POM is
>>>>>> referring both Hive 2.3.6 and 2.3.5 at the moment, see here
>>>>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L135>
>>>>>> and here
>>>>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L2927>
>>>>>> .)
>>>>>>
>>>>>> Again, I'm happy to get rid of ancient legacy dependencies like
>>>>>> Hadoop 2.7 and the Hive 1.2 fork, but I do believe that we need a
safety
>>>>>> net for Spark 3.0. For preview releases, I'm afraid that their visibility
>>>>>> is not good enough for covering such major upgrades.
>>>>>>
>>>>>> On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun <
>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>
>>>>>>> Thank you for feedback, Hyujkjin and Sean.
>>>>>>>
>>>>>>> I proposed `preview-2` for that purpose but I'm also +1 for do
that
>>>>>>> at 3.1
>>>>>>> if we can make a decision to eliminate the illegitimate Hive
fork
>>>>>>> reference
>>>>>>> immediately after `branch-3.0` cut.
>>>>>>>
>>>>>>> Sean, I'm referencing Cheng Lian's email for the status of
>>>>>>> `hadoop-2.7`.
>>>>>>>
>>>>>>> -
>>>>>>> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E
>>>>>>>
>>>>>>> The way I see this is that it's not a user problem. Apache Spark
>>>>>>> community didn't try to drop the illegitimate Hive fork yet.
>>>>>>> We need to drop it by ourselves because we created it and it's
our
>>>>>>> bad.
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <srowen@gmail.com>
wrote:
>>>>>>>
>>>>>>>> Just to clarify, as even I have lost the details over time:
>>>>>>>> hadoop-2.7
>>>>>>>> works with hive-2.3? it isn't tied to hadoop-3.2?
>>>>>>>> Roughly how much risk is there in using the Hive 1.x fork
over Hive
>>>>>>>> 2.x, for end users using Hive via Spark?
>>>>>>>> I don't have a strong opinion, other than sharing the view
that we
>>>>>>>> have to dump the Hive 1.x fork at the first opportunity.
>>>>>>>> Question is simply how much risk that entails. Keeping in
mind that
>>>>>>>> Spark 3.0 is already something that people understand works
>>>>>>>> differently. We can accept some behavior changes.
>>>>>>>>
>>>>>>>> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <
>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>> >
>>>>>>>> > Hi, All.
>>>>>>>> >
>>>>>>>> > First of all, I want to put this as a policy issue instead
of a
>>>>>>>> technical issue.
>>>>>>>> > Also, this is orthogonal from `hadoop` version discussion.
>>>>>>>> >
>>>>>>>> > Apache Spark community kept (not maintained) the forked
Apache
>>>>>>>> Hive
>>>>>>>> > 1.2.1 because there has been no other options before.
As we see at
>>>>>>>> > SPARK-20202, it's not a desirable situation among the
Apache
>>>>>>>> projects.
>>>>>>>> >
>>>>>>>> >     https://issues.apache.org/jira/browse/SPARK-20202
>>>>>>>> >
>>>>>>>> > Also, please note that we `kept`, not `maintained`,
because we
>>>>>>>> know it's not good.
>>>>>>>> > There are several attempt to update that forked repository
>>>>>>>> > for several reasons (Hadoop 3 support is one of the
example),
>>>>>>>> > but those attempts are also turned down.
>>>>>>>> >
>>>>>>>> > From Apache Spark 3.0, it seems that we have a new feasible
option
>>>>>>>> > `hive-2.3` profile. What about moving forward in this
direction
>>>>>>>> further?
>>>>>>>> >
>>>>>>>> > For example, can we remove the usage of forked `hive`
in Apache
>>>>>>>> Spark 3.0
>>>>>>>> > completely officially? If someone still needs to use
the forked
>>>>>>>> `hive`, we can
>>>>>>>> > have a profile `hive-1.2`. Of course, it should not
be a default
>>>>>>>> profile in the community.
>>>>>>>> >
>>>>>>>> > I want to say this is a goal we should achieve someday.
>>>>>>>> > If we don't do anything, nothing happen. At least we
need to
>>>>>>>> prepare this.
>>>>>>>> > Without any preparation, Spark 3.1+ will be the same.
>>>>>>>> >
>>>>>>>> > Shall we focus on what are our problems with Hive 2.3.6?
>>>>>>>> > If the only reason is that we didn't use it before,
we can
>>>>>>>> release another
>>>>>>>> > `3.0.0-preview` for that.
>>>>>>>> >
>>>>>>>> > Bests,
>>>>>>>> > Dongjoon.
>>>>>>>>
>>>>>>>

Mime
View raw message