spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Chammas <nicholas.cham...@gmail.com>
Subject Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
Date Wed, 20 Nov 2019 04:56:26 GMT
> I don't think the default Hadoop version matters except for the
spark-hadoop-cloud module, which is only meaningful under the hadoop-3.2
profile.

What do you mean by "only meaningful under the hadoop-3.2 profile"?

On Tue, Nov 19, 2019 at 5:40 PM Cheng Lian <lian.cs.zju@gmail.com> wrote:

> Hey Steve,
>
> In terms of Maven artifact, I don't think the default Hadoop version
> matters except for the spark-hadoop-cloud module, which is only meaningful
> under the hadoop-3.2 profile. All  the other spark-* artifacts published to
> Maven central are Hadoop-version-neutral.
>
> Another issue about switching the default Hadoop version to 3.2 is PySpark
> distribution. Right now, we only publish PySpark artifacts prebuilt with
> Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency to
> 3.2 is feasible for PySpark users. Or maybe we should publish PySpark
> prebuilt with both Hadoop 2.x and 3.x. I'm open to suggestions on this one.
>
> Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via the
> proposed hive-2.3 profile, I personally don't have a preference over having
> Hadoop 2.7 or 3.2 as the default Hadoop version. But just for minimizing
> the release management work, in case we decided to publish other spark-*
> Maven artifacts from a Hadoop 2.7 build, we can still special case
> spark-hadoop-cloud and publish it using a hadoop-3.2 build.
>
> On Mon, Nov 18, 2019 at 8:39 PM Dongjoon Hyun <dongjoon.hyun@gmail.com>
> wrote:
>
>> I also agree with Steve and Felix.
>>
>> Let's have another thread to discuss Hive issue
>>
>> because this thread was originally for `hadoop` version.
>>
>> And, now we can have `hive-2.3` profile for both `hadoop-2.7` and
>> `hadoop-3.0` versions.
>>
>> We don't need to mix both.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung <felixcheung_m@hotmail.com>
>> wrote:
>>
>>> 1000% with Steve, the org.spark-project hive 1.2 will need a solution.
>>> It is old and rather buggy; and It’s been *years*
>>>
>>> I think we should decouple hive change from everything else if people
>>> are concerned?
>>>
>>> ------------------------------
>>> *From:* Steve Loughran <stevel@cloudera.com.INVALID>
>>> *Sent:* Sunday, November 17, 2019 9:22:09 AM
>>> *To:* Cheng Lian <lian.cs.zju@gmail.com>
>>> *Cc:* Sean Owen <srowen@gmail.com>; Wenchen Fan <cloud0fan@gmail.com>;
>>> Dongjoon Hyun <dongjoon.hyun@gmail.com>; dev <dev@spark.apache.org>;
>>> Yuming Wang <wgyumg@gmail.com>
>>> *Subject:* Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
>>>
>>> Can I take this moment to remind everyone that the version of hive which
>>> spark has historically bundled (the org.spark-project one) is an orphan
>>> project put together to deal with Hive's shading issues and a source of
>>> unhappiness in the Hive project. What ever get shipped should do its best
>>> to avoid including that file.
>>>
>>> Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest
>>> move from a risk minimisation perspective. If something has broken then it
>>> is you can start with the assumption that it is in the o.a.s packages
>>> without having to debug o.a.hadoop and o.a.hive first. There is a cost: if
>>> there are problems with the hadoop / hive dependencies those teams will
>>> inevitably ignore filed bug reports for the same reason spark team will
>>> probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for the
>>> Hadoop 2.x line include any compatibility issues with Java 9+. Do bear that
>>> in mind. It's not been tested, it has dependencies on artifacts we know are
>>> incompatible, and as far as the Hadoop project is concerned: people should
>>> move to branch 3 if they want to run on a modern version of Java
>>>
>>> It would be really really good if the published spark maven artefacts
>>> (a) included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop
>>> 3.x. That way people doing things with their own projects will get
>>> up-to-date dependencies and don't get WONTFIX responses themselves.
>>>
>>> -Steve
>>>
>>> PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last
>>> ever" branch-2 release and then declare its predecessors EOL; 2.10 will be
>>> the transition release.
>>>
>>> On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian <lian.cs.zju@gmail.com>
>>> wrote:
>>>
>>> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
>>> thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
>>> seemed risky, and therefore we only introduced Hive 2.3 under the
>>> hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
>>> here...
>>>
>>> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed
>>> that Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not
>>> about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11
>>> upgrade together looks too risky.
>>>
>>> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen <srowen@gmail.com> wrote:
>>>
>>> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
>>> than introduce yet another build combination. Does Hadoop 2 + Hive 2
>>> work and is there demand for it?
>>>
>>> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan <cloud0fan@gmail.com> wrote:
>>> >
>>> > Do we have a limitation on the number of pre-built distributions?
>>> Seems this time we need
>>> > 1. hadoop 2.7 + hive 1.2
>>> > 2. hadoop 2.7 + hive 2.3
>>> > 3. hadoop 3 + hive 2.3
>>> >
>>> > AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so
>>> don't need to add JDK version to the combination.
>>> >
>>> > On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun <dongjoon.hyun@gmail.com>
>>> wrote:
>>> >>
>>> >> Thank you for suggestion.
>>> >>
>>> >> Having `hive-2.3` profile sounds good to me because it's orthogonal
>>> to Hadoop 3.
>>> >> IIRC, originally, it was proposed in that way, but we put it under
>>> `hadoop-3.2` to avoid adding new profiles at that time.
>>> >>
>>> >> And, I'm wondering if you are considering additional pre-built
>>> distribution and Jenkins jobs.
>>> >>
>>> >> Bests,
>>> >> Dongjoon.
>>> >>
>>>
>>>

Mime
View raw message