spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)
Date Thu, 21 Nov 2019 00:07:46 GMT
Sean, thanks for the corner cases you listed. They make a lot of sense. Now
I do incline to have Hive 2.3 as the default version.

Dongjoon, apologize if I didn't make it clear before. What made me
concerned initially was only the following part:

> can we remove the usage of forked `hive` in Apache Spark 3.0 completely
officially?

So having Hive 2.3 as the default Hive version and adding a `hive-1.2`
profile to keep the Hive 1.2.1 fork looks like a feasible approach to me.
Thanks for starting the discussion!

On Wed, Nov 20, 2019 at 9:46 AM Dongjoon Hyun <dongjoon.hyun@gmail.com>
wrote:

> Yes. Right. That's the situation we are hitting and the result I expected.
> We need to change our default with Hive 2 in the POM.
>
> Dongjoon.
>
>
> On Wed, Nov 20, 2019 at 5:20 AM Sean Owen <srowen@gmail.com> wrote:
>
>> Yes, good point. A user would get whatever the POM says without
>> profiles enabled so it matters.
>>
>> Playing it out, an app _should_ compile with the Spark dependency
>> marked 'provided'. In that case the app that is spark-submit-ted is
>> agnostic to the Hive dependency as the only one that matters is what's
>> on the cluster. Right? we don't leak through the Hive API in the Spark
>> API. And yes it's then up to the cluster to provide whatever version
>> it wants. Vendors will have made a specific version choice when
>> building their distro one way or the other.
>>
>> If you run a Spark cluster yourself, you're using the binary distro,
>> and we're already talking about also publishing a binary distro with
>> this variation, so that's not the issue.
>>
>> The corner cases where it might matter are:
>>
>> - I unintentionally package Spark in the app and by default pull in
>> Hive 2 when I will deploy against Hive 1. But that's user error, and
>> causes other problems
>> - I run tests locally in my project, which will pull in a default
>> version of Hive defined by the POM
>>
>> Double-checking, is that right? if so it kind of implies it doesn't
>> matter. Which is an argument either way about what's the default. I
>> too would then prefer defaulting to Hive 2 in the POM. Am I missing
>> something about the implication?
>>
>> (That fork will stay published forever anyway, that's not an issue per
>> se.)
>>
>> On Wed, Nov 20, 2019 at 1:40 AM Dongjoon Hyun <dongjoon.hyun@gmail.com>
>> wrote:
>> > Sean, our published POM is pointing and advertising the illegitimate
>> Hive 1.2 fork as a compile dependency.
>> > Yes. It can be overridden. So, why does Apache Spark need to publish
>> like that?
>> > If someone want to use that illegitimate Hive 1.2 fork, let them
>> override it. We are unable to delete those illegitimate Hive 1.2 fork.
>> > Those artifacts will be orphans.
>> >
>>
>

Mime
View raw message