sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dániel Vörös <daniel.vo...@gmail.com>
Subject Re: Release to support Hadoop 3
Date Thu, 10 May 2018 13:27:49 GMT
Dear All,

After Bogi has created the 3.0.0 version in Jira I've applied it to a
couple of tickets that don't make sense on the 1.x line (without
Hadoop3/Hive3).

However, as Bogi has mentioned in her previous email, it probably doesn't
make sense to work on a 1.5 release in parallel with 3.0.0. How would you
feel if we were to move all 1.5 issues [1] to 3.0.0?

In the meantime I've experimented with running Sqoop 1.4.7 against Hadoop
3.1.0, and I'm planning to do the opposite, running Sqoop 3.0.0-SNAPSHOT
against Hadoop 2.x. That way we'd be able to better assess Attila's
question about backward compatibility. Please note, that the hard part will
be Hive integration I'm afraid, and until there's no Hive 3.0 release it's
hard to test. If anyone's interested in this topic, check out [2].

Regards,
Daniel

[1]
https://issues.apache.org/jira/issues?jql=project%20%3D%20SQOOP%20and%20fixVersion%20%3D%201.5.0%20and%20resolutionDate%20is%20not%20%20empty%20order%20by%20resolutiondate%20desc
[2] https://github.com/dvoros/docker-sqoop

On Mon, Apr 16, 2018 at 2:20 PM Szabolcs Vasas <vasas@apache.org> wrote:

> Hi All,
>
> Sqoop NG/Sqoop 3:
> As far as I remember Sqoop NG was an alternative name suggested for Sqoop 2
> which has a totally different architecture than Sqoop 1. I would not use
> now since in this release we do not include changes affecting the
> architecture but bumping the versions of the dependencies. However since
> dependencies are bumped to another major releases I think we should also
> change the major version number of Sqoop.
>
> Hadoop 2 support:
> I agree with Daniel that we should not introduce extra complexity to
> support Hadoop 2 as well. However even if we don't support Hadoop 2 in our
> next major Sqoop release some features which do not require Hadoop 3 could
> be backported by the vendors to their earlier releases as well. I think
> introducing a 1.x branch upstream would lead to an increased complexity of
> committing bug fixes and I am not sure the community wants to make a
> release in Sqoop 1.x branch. Even if at some point somebody wants to do
> this they could cut the branch and cherry-pick the necessary bug fixes
> right before the release.
>
> Kite removal:
> I agree that this is quite complex task on its own but we can't bump the
> Hadoop/Hive/HBase dependencies without deciding what to do with Kite. One
> option is to bump these dependencies in Kite too, create a new Kite release
> and bump Sqoop's Kite dependency to this new release. Another option is to
> get rid of the Kite dependency before we bump Hadoop/Hive/HBase version. In
> my opinion the latter one makes more sense since we wanted to eliminate the
> Kite dependency anyway and the Kite project seems to be dead so bumping the
> dependencies, making the necessary code changes, fixing tests and creating
> the release might be an overkill.
>
> Szabolcs
>
> On Mon, Apr 16, 2018 at 11:50 AM, Dániel Vörös <daniel.voros@gmail.com>
> wrote:
>
> > Hi All,
> >
> > I believe we're all on the same page on removing Kite, so I've opened
> > SQOOP-3313 to track that. @Attila I'm glad to see you're interest in the
> > ORC part. It would be highly appreciated if you could take a look at this
> > review request[1].
> >
> > I'm not that familiar with Flume, but it seems they've added NG after
> > architectural changes and released FlumeNG 1.0 after Flume 0.9.4 [2].
> Even
> > if we go with NG, I'd suggest calling it 3.0, to avoid confusion with
> > earlier releases.
> >
> > I think the biggest part of keeping Hadoop 2 (and previous versions of
> > downstream projects like Hive) supported would be testing against those.
> It
> > would also require at least another build profile to build against them,
> > and probably another layer of abstraction in the code (like Hadoop shims
> in
> > Hive).
> > Not sure about vendors, but I think they're usually not adding new
> features
> > to older release lines. In my opinion we should branch off from current
> > trunk to track the 1.x release line (where we keep supporting Hadoop 2)
> and
> > keep adding bugfixes there, but add new features to trunk only and don't
> > worry about Hadoop 2 there.
> >
> > I agree with Attila on the dependencies. We shouldn't release based on
> > non-final releases. We might bump the dependencies to some alpha/beta
> > during development, but don't forget to move to the final version in the
> > end.
> >
> > +1 for Bogi as release manager.
> >
> > Regards,
> > Daniel
> >
> > [1] https://reviews.apache.org/r/66548/
> > [2] https://blogs.apache.org/flume/entry/flume_ng_architecture
> >
> > On Fri, Apr 13, 2018 at 5:24 PM Szabó Attila <maugli@inf.elte.hu> wrote:
> >
> > >
> > >
> > > Hello everyone,
> > >
> > >
> > > I'd like to also attach my thoughts:
> > >
> > >
> > > New Sqoop version: Last time when I'd the chance to talk about this
> with
> > > some of the PMC members (e.g. Jarcec, Kate ) we've been on the front to
> > > create Sqoop-NG (NG == Next Generation), quite the same what the Flume
> > > community did (and AFAIK from Mike Percy it's been a quite successful
> act
> > > from their POV). Don't get me wrong, I'm totall NOT against 3.0, though
> > > IMHO Sqoop-NG 1.0 would be a better choice.
> > >
> > >
> > > Kite: I would totally split this effort into two subtasks. First I
> would
> > > get in contact with the Parquet team, and would create a KITE
> independent
> > > execution path in Sqoop for the Parquet backed tables
> (Hive/Impala/etc.).
> > > As a part of this effort I would also add direct support for ORC format
> > (in
> > > the past few years I've found it very useful in several different
> > > situation, and usually it's quite inconvenient that Sqoop does not
> > support
> > > it "out of the box").
> > >
> > > As the second substask I would start to remove every KITE based
> > dependency
> > > (but according to my gut feeling it could break the codebase on too
> many
> > > places, and might not be that EZ to succeed on that front).
> > >
> > >
> > > Hadoop 2:
> > >
> > > Could anyone please highlight me what would be the pros/cons on this
> > > front? AFAIK several vendors (including Cloudera, Hortonworks, MapR,
> EMR,
> > > etc.) are still supporting Hadoop 2, and according to my best knowledge
> > > most of the userbase are connected to their releases, so I'd like to
> > > provide the chance for those users to use the newest features of Sqoop,
> > > thus I would vote for the compatibility for a bit more time/versions.
> > >
> > >
> > > Dependencies:
> > >
> > > I'd like to cast my very direct and LOUD vote against any alpha
> > > dependencies (including HBase or anything else!). IMHO Sqoop is
> already a
> > > stable component of the Apache Foundation, and the users can depend on
> > it,
> > > thus I'd like to avoid any kind of "immature" dependency related
> issues.
> > Of
> > > course this is also just my solo opinion, but as a community I think we
> > > must not undermine our stability.
> > >
> > > On the other fronts I totally agree and +1 with the planned efforts,
> > >
> > > Best regards,
> > > Attila
> > >
> > > ________________________________
> > > From: Szabolcs Vasas <vasas@apache.org>
> > > Sent: Friday, April 13, 2018 3:43 PM
> > > To: dev@sqoop.apache.org
> > > Subject: Re: Release to support Hadoop 3
> > >
> > > Hi all,
> > >
> > > I also think that completely eliminating the Kite dependency from Sqoop
> > > would be the easiest way of going forward, I will try to analyze this
> > topic
> > > a bit more next week and come up with subtasks so we could work on it
> in
> > > parallel potentially.
> > >
> > > I am happy with the Sqoop 3.0 scope proposal too and Bogi being the
> > release
> > > manager of it.
> > >
> > > Szabolcs
> > >
> > >
> > > On Fri, Apr 13, 2018 at 2:37 PM, Boglarka Egyed <bogi@apache.org>
> wrote:
> > >
> > > > Hi Daniel et al,
> > > >
> > > > Thanks for bringing up this topic and the detailed status update.
> > > >
> > > > I am sharing my thoughts point by point, please find them below.
> > > >
> > > > 1) How to get a new Kite release? Maybe we should remove the Kite
> > > > > dependency altogether (as Szabolcs hinted in comments of
> SQOOP-3171)?
> > > >
> > > >
> > > > I think making a new Kite release would be a huge effort as it would
> > > > require upgrading the versions, making the necessary code
> > modifications,
> > > > testing it thoroughly, etc. then making the release itself meanwhile
> > Kite
> > > > is a very passively handled tool having minimal activity on it thus
> it
> > > > would definitely mean a lot of effort to get it done. It would have a
> > > > dependency on Solr community too as the Morphlines module of Kite is
> > > > heavily used and somewhat actively developed by them. Also indeed
> there
> > > is
> > > > a shorter/longer term goal to get rid of Kite dependency in Sqoop
> > > entirely,
> > > > i.e. all release efforts would become throw-away very soon.
> > > >
> > > > Focusing on the Kite removal seems to be more reasonable to me.
> However
> > > it
> > > > would be great to see an estimation regarding this effort, @Szabolcs
> > > could
> > > > you maybe share your thoughts on this?
> > > >
> > > > 2) Should we drop support for Hadoop 2?
> > > > >
> > > >
> > > > I think we can drop support for Hadoop 2 especially if we use
> > > > straightforward versioning with the new release.
> > > >
> > > >
> > > > > 3) What version number should we use? To avoid confusion with
> Sqoop2
> > > I'd
> > > > go
> > > > > with 3.0.
> > > > >
> > > >
> > > > I like this idea, +1 for making a 3.0 release containing these
> changes.
> > > >
> > > >
> > > > > 4) Does (should?) this affect the 1.5 release?
> > > >
> > > >
> > > > I think the answer is yes. Currently the following breaking changes
> are
> > > on
> > > > the horizon which could be part of a next Sqoop release:
> > > > * com.cloudera package removal (done)
> > > > * Gradle introduction (in progress)
> > > > * Hadoop/Hive/HBase version upgrade (in progress)
> > > > * Kite deprecation/removal (planned)
> > > > * Bump Java version to 8 (planned )
> > > >
> > > > Looking at this list I would say that making a Sqoop 1.5 release
> > > containing
> > > > only the com.cloudera package removal, the Gradle introduction and
> the
> > > Java
> > > > version bump would mean a somewhat small and irrelevant scope from a
> > user
> > > > perspective so maybe having two releases (1.5 and 3.0) would be a
> > little
> > > > bit overkill. I would instead suggest to go with a Sqoop 3.0 release
> > > > containing all the changes listed above. What do you think?
> > > >
> > > > Summarizing it up I see the following dependencies for a next Sqoop
> > > release
> > > > currently:
> > > > * Finishing up the Gradle patch
> > > > * Hive 3 release
> > > > * Kite removal - this could be the next common effort in the
> community
> > > >
> > > > Anyhow I would be happy to take the Release Manager role for the next
> > > > release, please let me know if everyone would be OK with that.
> > > >
> > > > I am looking forward to see others thoughts on this too.
> > > >
> > > > Many thanks,
> > > > Bogi
> > > >
> > > > On Thu, Apr 12, 2018 at 5:17 PM, Dániel Vörös <
> daniel.voros@gmail.com>
> > > > wrote:
> > > >
> > > > > Dear All,
> > > > >
> > > > > After some development towards supporting Hadoop 3 (and latest
> > version
> > > of
> > > > > downstream components) I'd like to summarize the current state of
> the
> > > > > upgrade and start the conversation about releasing a new version
of
> > > Sqoop
> > > > > with Hadoop 3 support.
> > > > >
> > > > > Here's what happened so far:
> > > > >  - Upgraded Hadoop dependency to 3.0.0
> > > > >  - Hive had to be upgraded, since old Hive didn't work with Hadoop
> 3.
> > > > >  - HBase had to be upgraded since Hive 3 depends on HBase 2(alpha)
> > > > >  - Dealt with a bunch of minor issues like changed Hadoop
> > configuration
> > > > > names and different packaging of Maven artifacts.
> > > > >
> > > > > For details please refer to this ticket and the attached review
> > > request:
> > > > > https://issues.apache.org/jira/browse/SQOOP-3305
> > > > >
> > > > > Remaining work:
> > > > >  - Parquet importing doesn't work. It was broken by a
> > > > standalone-metastore
> > > > > change in Hive and fixing would require a new Kite version to be
> > built
> > > > > against Hive 3.
> > > > >  - Hive 3 is going to enable ACID tables by default. We should
> > support
> > > > > importing into these. Details:
> > > > > https://issues.apache.org/jira/browse/SQOOP-3311
> > > > >
> > > > > Other blocking issues:
> > > > >  - There's no Hive 3 release (no alpha/beta) yet.
> > > > >
> > > > > I'd like to kindly ask you all to share any other tasks/issues you
> > know
> > > > of
> > > > > that we should address to support the latest versions. Also, there
> > are
> > > a
> > > > > couple open questions:
> > > > >  1) How to get a new Kite release? Maybe we should remove the Kite
> > > > > dependency altogether (as Szabolcs hinted in comments of
> SQOOP-3171)?
> > > > >  2) Should we drop support for Hadoop 2?
> > > > >  3) What version number should we use? To avoid confusion with
> Sqoop2
> > > I'd
> > > > > go with 3.0.
> > > > >  4) Does (should?) this affect the 1.5 release?
> > > > >
> > > > > Regards,
> > > > > Daniel
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message