spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wenchen Fan <cloud0...@gmail.com>
Subject Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
Date Sat, 16 Nov 2019 09:52:27 GMT
Do we have a limitation on the number of pre-built distributions? Seems
this time we need
1. hadoop 2.7 + hive 1.2
2. hadoop 2.7 + hive 2.3
3. hadoop 3 + hive 2.3

AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so don't
need to add JDK version to the combination.

On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun <dongjoon.hyun@gmail.com>
wrote:

> Thank you for suggestion.
>
> Having `hive-2.3` profile sounds good to me because it's orthogonal to
> Hadoop 3.
> IIRC, originally, it was proposed in that way, but we put it under
> `hadoop-3.2` to avoid adding new profiles at that time.
>
> And, I'm wondering if you are considering additional pre-built
> distribution and Jenkins jobs.
>
> Bests,
> Dongjoon.
>
>
>
> On Fri, Nov 15, 2019 at 1:38 PM Cheng Lian <lian.cs.zju@gmail.com> wrote:
>
>> Cc Yuming, Steve, and Dongjoon
>>
>> On Fri, Nov 15, 2019 at 10:37 AM Cheng Lian <lian.cs.zju@gmail.com>
>> wrote:
>>
>>> Similar to Xiao, my major concern about making Hadoop 3.2 the default
>>> Hadoop version is quality control. The current hadoop-3.2 profile
>>> covers too many major component upgrades, i.e.:
>>>
>>>    - Hadoop 3.2
>>>    - Hive 2.3
>>>    - JDK 11
>>>
>>> We have already found and fixed some feature and performance regressions
>>> related to these upgrades. Empirically, I’m not surprised at all if more
>>> regressions are lurking somewhere. On the other hand, we do want help from
>>> the community to help us to evaluate and stabilize these new changes.
>>> Following that, I’d like to propose:
>>>
>>>    1.
>>>
>>>    Introduce a new profile hive-2.3 to enable (hopefully) less risky
>>>    Hadoop/Hive/JDK version combinations.
>>>
>>>    This new profile allows us to decouple Hive 2.3 from the hadoop-3.2
>>>    profile, so that users may try out some less risky Hadoop/Hive/JDK
>>>    combinations: if you only want Hive 2.3 and/or JDK 11, you don’t need to
>>>    face potential regressions introduced by the Hadoop 3.2 upgrade.
>>>
>>>    Yuming Wang has already sent out PR #26533
>>>    <https://github.com/apache/spark/pull/26533> to exercise the Hadoop
>>>    2.7 + Hive 2.3 + JDK 11 combination (this PR does not have the
>>>    hive-2.3 profile yet), and the result looks promising: the Kafka
>>>    streaming and Arrow related test failures should be irrelevant to the topic
>>>    discussed here.
>>>
>>>    After decoupling Hive 2.3 and Hadoop 3.2, I don’t think it makes a
>>>    lot of difference between having Hadoop 2.7 or Hadoop 3.2 as the default
>>>    Hadoop version. For users who are still using Hadoop 2.x in production,
>>>    they will have to use a hadoop-provided prebuilt package or build
>>>    Spark 3.0 against their own 2.x version anyway. It does make a difference
>>>    for cloud users who don’t use Hadoop at all, though. And this probably also
>>>    helps to stabilize the Hadoop 3.2 code path faster since our PR builder
>>>    will exercise it regularly.
>>>    2.
>>>
>>>    Defer Hadoop 2.x upgrade to Spark 3.1+
>>>
>>>    I personally do want to bump our Hadoop 2.x version to 2.9 or even
>>>    2.10. Steve has already stated the benefits very well. My worry here is
>>>    still quality control: Spark 3.0 has already had tons of changes and major
>>>    component version upgrades that are subject to all kinds of known and
>>>    hidden regressions. Having Hadoop 2.7 there provides us a safety net, since
>>>    it’s proven to be stable. To me, it’s much less risky to upgrade Hadoop
2.7
>>>    to 2.9/2.10 after we stabilize the Hadoop 3.2/Hive 2.3 combinations in the
>>>    next 1 or 2 Spark 3.x releases.
>>>
>>> Cheng
>>>
>>> On Mon, Nov 4, 2019 at 11:24 AM Koert Kuipers <koert@tresata.com> wrote:
>>>
>>>> i get that cdh and hdp backport a lot and in that way left 2.7 behind.
>>>> but they kept the public apis stable at the 2.7 level, because thats kind
>>>> of the point. arent those the hadoop apis spark uses?
>>>>
>>>> On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran
>>>> <stevel@cloudera.com.invalid> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas <
>>>>> nicholas.chammas@gmail.com> wrote:
>>>>>
>>>>>> On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran
>>>>>> <stevel@cloudera.com.invalid> wrote:
>>>>>>
>>>>>>> It would be really good if the spark distributions shipped with
>>>>>>> later versions of the hadoop artifacts.
>>>>>>>
>>>>>>
>>>>>> I second this. If we need to keep a Hadoop 2.x profile around, why
>>>>>> not make it Hadoop 2.8 or something newer?
>>>>>>
>>>>>
>>>>> go for 2.9
>>>>>
>>>>>>
>>>>>> Koert Kuipers <koert@tresata.com> wrote:
>>>>>>
>>>>>>> given that latest hdp 2.x is still hadoop 2.7 bumping hadoop
2
>>>>>>> profile to latest would probably be an issue for us.
>>>>>>
>>>>>>
>>>>>> When was the last time HDP 2.x bumped their minor version of Hadoop?
>>>>>> Do we want to wait for them to bump to Hadoop 2.8 before we do the
same?
>>>>>>
>>>>>
>>>>> The internal builds of CDH and HDP are not those of ASF 2.7.x. A
>>>>> really large proportion of the later branch-2 patches are backported.
2,7
>>>>> was left behind a long time ago
>>>>>
>>>>>
>>>>>
>>>>>
>>>>

Mime
View raw message