spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dongjoon Hyun <dongjoon.h...@gmail.com>
Subject Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)
Date Thu, 21 Nov 2019 00:53:49 GMT
Thank you all.

I'll try to make JIRA and PR for that.

Bests,
Dongjoon.

On Wed, Nov 20, 2019 at 4:08 PM Cheng Lian <lian.cs.zju@gmail.com> wrote:

> Sean, thanks for the corner cases you listed. They make a lot of sense.
> Now I do incline to have Hive 2.3 as the default version.
>
> Dongjoon, apologize if I didn't make it clear before. What made me
> concerned initially was only the following part:
>
> > can we remove the usage of forked `hive` in Apache Spark 3.0 completely
> officially?
>
> So having Hive 2.3 as the default Hive version and adding a `hive-1.2`
> profile to keep the Hive 1.2.1 fork looks like a feasible approach to me.
> Thanks for starting the discussion!
>
> On Wed, Nov 20, 2019 at 9:46 AM Dongjoon Hyun <dongjoon.hyun@gmail.com>
> wrote:
>
>> Yes. Right. That's the situation we are hitting and the result I expected.
>> We need to change our default with Hive 2 in the POM.
>>
>> Dongjoon.
>>
>>
>> On Wed, Nov 20, 2019 at 5:20 AM Sean Owen <srowen@gmail.com> wrote:
>>
>>> Yes, good point. A user would get whatever the POM says without
>>> profiles enabled so it matters.
>>>
>>> Playing it out, an app _should_ compile with the Spark dependency
>>> marked 'provided'. In that case the app that is spark-submit-ted is
>>> agnostic to the Hive dependency as the only one that matters is what's
>>> on the cluster. Right? we don't leak through the Hive API in the Spark
>>> API. And yes it's then up to the cluster to provide whatever version
>>> it wants. Vendors will have made a specific version choice when
>>> building their distro one way or the other.
>>>
>>> If you run a Spark cluster yourself, you're using the binary distro,
>>> and we're already talking about also publishing a binary distro with
>>> this variation, so that's not the issue.
>>>
>>> The corner cases where it might matter are:
>>>
>>> - I unintentionally package Spark in the app and by default pull in
>>> Hive 2 when I will deploy against Hive 1. But that's user error, and
>>> causes other problems
>>> - I run tests locally in my project, which will pull in a default
>>> version of Hive defined by the POM
>>>
>>> Double-checking, is that right? if so it kind of implies it doesn't
>>> matter. Which is an argument either way about what's the default. I
>>> too would then prefer defaulting to Hive 2 in the POM. Am I missing
>>> something about the implication?
>>>
>>> (That fork will stay published forever anyway, that's not an issue per
>>> se.)
>>>
>>> On Wed, Nov 20, 2019 at 1:40 AM Dongjoon Hyun <dongjoon.hyun@gmail.com>
>>> wrote:
>>> > Sean, our published POM is pointing and advertising the illegitimate
>>> Hive 1.2 fork as a compile dependency.
>>> > Yes. It can be overridden. So, why does Apache Spark need to publish
>>> like that?
>>> > If someone want to use that illegitimate Hive 1.2 fork, let them
>>> override it. We are unable to delete those illegitimate Hive 1.2 fork.
>>> > Those artifacts will be orphans.
>>> >
>>>
>>

Mime
View raw message