spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dongjoon Hyun <dongjoon.h...@gmail.com>
Subject Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
Date Sat, 16 Nov 2019 08:05:32 GMT
Thank you for suggestion.

Having `hive-2.3` profile sounds good to me because it's orthogonal to
Hadoop 3.
IIRC, originally, it was proposed in that way, but we put it under
`hadoop-3.2` to avoid adding new profiles at that time.

And, I'm wondering if you are considering additional pre-built distribution
and Jenkins jobs.

Bests,
Dongjoon.



On Fri, Nov 15, 2019 at 1:38 PM Cheng Lian <lian.cs.zju@gmail.com> wrote:

> Cc Yuming, Steve, and Dongjoon
>
> On Fri, Nov 15, 2019 at 10:37 AM Cheng Lian <lian.cs.zju@gmail.com> wrote:
>
>> Similar to Xiao, my major concern about making Hadoop 3.2 the default
>> Hadoop version is quality control. The current hadoop-3.2 profile covers
>> too many major component upgrades, i.e.:
>>
>>    - Hadoop 3.2
>>    - Hive 2.3
>>    - JDK 11
>>
>> We have already found and fixed some feature and performance regressions
>> related to these upgrades. Empirically, I’m not surprised at all if more
>> regressions are lurking somewhere. On the other hand, we do want help from
>> the community to help us to evaluate and stabilize these new changes.
>> Following that, I’d like to propose:
>>
>>    1.
>>
>>    Introduce a new profile hive-2.3 to enable (hopefully) less risky
>>    Hadoop/Hive/JDK version combinations.
>>
>>    This new profile allows us to decouple Hive 2.3 from the hadoop-3.2
>>    profile, so that users may try out some less risky Hadoop/Hive/JDK
>>    combinations: if you only want Hive 2.3 and/or JDK 11, you don’t need to
>>    face potential regressions introduced by the Hadoop 3.2 upgrade.
>>
>>    Yuming Wang has already sent out PR #26533
>>    <https://github.com/apache/spark/pull/26533> to exercise the Hadoop
>>    2.7 + Hive 2.3 + JDK 11 combination (this PR does not have the
>>    hive-2.3 profile yet), and the result looks promising: the Kafka
>>    streaming and Arrow related test failures should be irrelevant to the topic
>>    discussed here.
>>
>>    After decoupling Hive 2.3 and Hadoop 3.2, I don’t think it makes a
>>    lot of difference between having Hadoop 2.7 or Hadoop 3.2 as the default
>>    Hadoop version. For users who are still using Hadoop 2.x in production,
>>    they will have to use a hadoop-provided prebuilt package or build
>>    Spark 3.0 against their own 2.x version anyway. It does make a difference
>>    for cloud users who don’t use Hadoop at all, though. And this probably also
>>    helps to stabilize the Hadoop 3.2 code path faster since our PR builder
>>    will exercise it regularly.
>>    2.
>>
>>    Defer Hadoop 2.x upgrade to Spark 3.1+
>>
>>    I personally do want to bump our Hadoop 2.x version to 2.9 or even
>>    2.10. Steve has already stated the benefits very well. My worry here is
>>    still quality control: Spark 3.0 has already had tons of changes and major
>>    component version upgrades that are subject to all kinds of known and
>>    hidden regressions. Having Hadoop 2.7 there provides us a safety net, since
>>    it’s proven to be stable. To me, it’s much less risky to upgrade Hadoop 2.7
>>    to 2.9/2.10 after we stabilize the Hadoop 3.2/Hive 2.3 combinations in the
>>    next 1 or 2 Spark 3.x releases.
>>
>> Cheng
>>
>> On Mon, Nov 4, 2019 at 11:24 AM Koert Kuipers <koert@tresata.com> wrote:
>>
>>> i get that cdh and hdp backport a lot and in that way left 2.7 behind.
>>> but they kept the public apis stable at the 2.7 level, because thats kind
>>> of the point. arent those the hadoop apis spark uses?
>>>
>>> On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran
>>> <stevel@cloudera.com.invalid> wrote:
>>>
>>>>
>>>>
>>>> On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas <
>>>> nicholas.chammas@gmail.com> wrote:
>>>>
>>>>> On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran
>>>>> <stevel@cloudera.com.invalid> wrote:
>>>>>
>>>>>> It would be really good if the spark distributions shipped with later
>>>>>> versions of the hadoop artifacts.
>>>>>>
>>>>>
>>>>> I second this. If we need to keep a Hadoop 2.x profile around, why not
>>>>> make it Hadoop 2.8 or something newer?
>>>>>
>>>>
>>>> go for 2.9
>>>>
>>>>>
>>>>> Koert Kuipers <koert@tresata.com> wrote:
>>>>>
>>>>>> given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2
>>>>>> profile to latest would probably be an issue for us.
>>>>>
>>>>>
>>>>> When was the last time HDP 2.x bumped their minor version of Hadoop?
>>>>> Do we want to wait for them to bump to Hadoop 2.8 before we do the same?
>>>>>
>>>>
>>>> The internal builds of CDH and HDP are not those of ASF 2.7.x. A really
>>>> large proportion of the later branch-2 patches are backported. 2,7 was left
>>>> behind a long time ago
>>>>
>>>>
>>>>
>>>>
>>>

Mime
View raw message