sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abraham Elmahrek <...@cloudera.com>
Subject Re: Hadoop as Compile time dependency in Sqoop2
Date Sat, 13 Dec 2014 01:55:49 GMT
Hey Richard,

I think Jarcec is agreeing here and that it's worth trying out. Let's move
forward with the current design?

-Abe

On Thu, Dec 11, 2014 at 9:13 PM, Zhou, Richard <richard.zhou@intel.com>
wrote:
>
>  Hi Jarcec & Abe:
>
> Thank you for your nice clarification. And I have got several opinions
> about it.
>
>
>
> 1.       As the Hadoop dependency is “provided” in the Sqoop server, and
> the real classpath is set in catalina.properties, how to avoid
> compatibility mistakes? Let’s say, Sqoop server is built successfully with
> Hadoop 2.5.1, showed in root pom.xml, whilst the real cluster is Hadoop
> 2.5.0. Having said that there are only minor changes between these two
> minor release, there should be some unexpected exception still.
>
> 2.       If “compile” is used in both client and server side. The wire
> compatibility is confirmed in authentication communication between client
> and server. But another compatibility surfaces, that the Sqoop server
> depends different version of Hadoop-common. 2.6.0 from “compile”, and 2.5.0
> (real cluster version) from classpath in catalina.properties.
>
> 3.       As Abe said, if we use partially “compile” (in client side) and
> partially “provided”( in server side), I agree that there must be some wire
> compatibility issues.
>
> 4.       The best solution to resolve all compatibility issues is that
> use “provided” in client and server side with Hadoop-common lib from
> classpath in real cluster, which must be 2.6.0 or later. However, it is
> impossible to make all users use Hadoop 2.6.0 or later.
>
>
>
> So, I am considering that is it a little rush to support delegation token
> currently? Since it is the latest feature in Hadoop 2.6.0, whilst Sqoop
> only support 2.5.1. Having said sooner or later Hadoop 2.6.0 will be
> supported in the near future, Sqoop must support Hadoop 2.5.1 or before for
> a long time as well. Maybe we should re-open delegation token support at
> Hadoop 3.* period, as delegation token should be supported that time. And
> as for Kerberos support task (SQOOP-1525), it could be finished with doAs
> function completed. Actually this code is ready, and the reason I have not
> uploaded for review is that I think delegation token is a better solution
> to handle this. There is no need to commit doAs code and then rewrite with
> delegation token. As for delegation token support, it could be put into
> improvement of Kerberos support.
>
>
>
> Richard
>
>
>
> *From:* Abraham Elmahrek [mailto:abe@cloudera.com]
> *Sent:* Friday, December 12, 2014 8:14 AM
> *To:* dev@sqoop.apache.org
> *Cc:* Zhou, Richard
>
> *Subject:* Re: Hadoop as Compile time dependency in Sqoop2
>
>
>
> I'll have to do a bit of experimentation to better understand packaging
> and dependencies. If we make hadoop-common a compile time requirement
> conditionally in sqoop-core, this should affect the classpath of the other
> components in the server? In the DependencyManagement section of the root
> pom, it would still be marked as provided?
>
>
>
> -Abe
>
>
>
> On Thu, Dec 11, 2014 at 5:50 PM, Jarek Jarcec Cecho <jarcec@apache.org>
> wrote:
>
> Got it, so the proposal is really to ship Hadoop libraries as part of our
> distribution (tarball) and not let users to configure Sqoop using existing
> ones. I personally don’t feel entirely comfortable doing so as I’m afraid
> that a lot of troubles will pop up on the way (given my experience), but
> I’m open to give it a try. Just to be on the same page, we want to package
> the Hadoop-common with server only right? So I’m assuming that the
> “compile” dependency will be on sqoop-core rather then sqoop-common (that
> is shared between client and server).
>
> Jarcec
>
>
> > On Dec 11, 2014, at 3:34 PM, Abraham Elmahrek <abe@cloudera.com> wrote:
> >
> > Jarcec,
> >
> > I believe that providing delegation support requires using a class on the
> > server side that is only available in hadoop-common as of Hadoop 2.6.0
> [1].
> > This seems like reason enough to change from "provided" to "compile"
> given
> > the feature may not exist in previous versions of Hadoop2.
> >
> > Also, requiring that Sqoop2 must be used with Hadoop 2.6.0 or newer
> doesn't
> > seem like a great idea. It delegates hadoop version management to the
> users
> > of Sqoop2, where it might be better to be handled by devs?
> >
> > 1. https://issues.apache.org/jira/browse/HADOOP-11083
> >
> > On Thu, Dec 11, 2014 at 4:50 PM, Jarek Jarcec Cecho <jarcec@apache.org>
> > wrote:
> >>
> >> Nope not at all Abe, I also feel that client and server changes should
> be
> >> discussed separately as there are different reasons/concerns of why or
> why
> >> not introduce Hadoop dependencies there.
> >>
> >> For the server side and for the security portion, I feel that we had
> good
> >> discussion with Richard while back and I do not longer have concerns
> about
> >> using those APIs. I’ll advise caution nevertheless. What we are trying
> to
> >> achieve by changing the scope from “provided” to “compile” here? To
my
> best
> >> knowledge [1] the difference is only that “provided” means that the
> >> dependency is not retrieved and stored in resulting package and that
> users
> >> have to add it manually after installation. I’m not immediately seeing
> any
> >> impact on the code though.
> >>
> >> Jarcec
> >>
> >> Links:
> >> 1:
> >>
> http://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html
> >>
> >>> On Dec 11, 2014, at 8:41 AM, Abraham Elmahrek <abe@cloudera.com>
> wrote:
> >>>
> >>> Jarcec,
> >>>
> >>> Sorry to bud in... you make a good point on the client side. Would you
> >> mind
> >>> if we discussed the server side a bit? Re-using the same mechanism on
> the
> >>> server side does require "compile" scope dependencies on Hadoop. Would
> >> that
> >>> be ok? Are the concerns mainly around the client?
> >>>
> >>> -Abe
> >>>
> >>> On Thu, Dec 11, 2014 at 10:30 AM, Jarek Jarcec Cecho <
> jarcec@apache.org>
> >>> wrote:
> >>>
> >>>> Got it Richard, thank you very much for the nice summary! I’m
> wondering
> >>>> what is the use case for delegation tokens on client side? Is it to
> >> support
> >>>> integration with Oozie?
> >>>>
> >>>> I do know that Beeline is depending on Hadoop common and that is
> >> actually
> >>>> a very good example. I’ve seen sufficient number of users struggling
> >> with
> >>>> this dependency - using various workarounds for the classpath issue,
> >> having
> >>>> need to copy over Hadoop configuration files from real cluster
> (because
> >>>> otherwise portion of the security didn’t work at all, something with
> >>>> auth_to_local rules) and a lot of more. That is why I’m advising being
> >>>> careful here.
> >>>>
> >>>> Jarcec
> >>>>
> >>>>> On Dec 11, 2014, at 12:17 AM, Zhou, Richard <richard.zhou@intel.com>
> >>>> wrote:
> >>>>>
> >>>>> Hi Jarcec:
> >>>>> Thank you very much for your clarification about the history.
> >>>>>
> >>>>> The root cause for why we want to change "provided" to "compile"
is
> to
> >>>> implement "Delegation Token Support" [1], review board [2]. The status
> >> in
> >>>> Hadoop is showed below.
> >>>>> Hadoop 2.5.1 or before: all classes used to implement Kerberos
> support
> >>>> is in Hadoop-auth component, which depends only several libs with
> >>>> non-Hadoop related lib. And it is added in Sqoop client side (shell
> >>>> component [3]) as "compile" as we agreed before.
> >>>>> Hadoop 2.6.0: There is a refactor to support delegation token in
> Hadoop
> >>>> [4]. Most components in Hadoop, such as RM, Httpfs and Kms, have
> >> rewritten
> >>>> authentication mechanism to use delegation token. However, all
> >> delegation
> >>>> token related class is in Hadoop-common instead of Hadoop-auth,
> because
> >> it
> >>>> uses UserGroupInfomation class.
> >>>>>
> >>>>> So if Sqoop need to support delegation token, it has to include
> >>>> Hadoop-common lib, because I believe that copying code is an
> >> unacceptable
> >>>> solution. Even using Hadoop shims, which is a good solution to support
> >>>> different version of Hadoop (I am +1 on writing a Hadoop shims in
> Sqoop
> >>>> like pig, hive etc.), the Hadoop-common is also a dependency. For
> >> example,
> >>>> the client side (beeline) in hive depends on Hadoop-common lib [5].
> So I
> >>>> don't think it is a big problem to add Hadoop-common in.
> >>>>>
> >>>>> Additionally, I agree with Abe that wire compatibility is another
> >> reason
> >>>> to change "provided" to "compile", since it is in "Unstable" state.
> >> There
> >>>> will be a potential problem in the future.
> >>>>>
> >>>>> So I prefer to add Hadoop-common lib as "compile" to make "Delegation
> >>>> Token Support" happen.
> >>>>>
> >>>>> Add intel-sqoop@cloudera.org.
> >>>>>
> >>>>> Links:
> >>>>> 1: https://issues.apache.org/jira/browse/SQOOP-1776
> >>>>> 2: https://reviews.apache.org/r/28795/
> >>>>> 3: https://github.com/apache/sqoop/blob/sqoop2/shell/pom.xml#L75
> >>>>> 4: https://issues.apache.org/jira/browse/HADOOP-10771
> >>>>> 5: https://github.com/apache/hive/blob/trunk/beeline/pom.xml#L133
> >>>>>
> >>>>> Richard
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Jarek Jarcec Cecho [mailto:jarcec@gmail.com] On Behalf Of
> Jarek
> >>>> Jarcec Cecho
> >>>>> Sent: Thursday, December 11, 2014 1:43 PM
> >>>>> To: dev@sqoop.apache.org
> >>>>> Subject: Re: Hadoop as Compile time dependency in Sqoop2
> >>>>>
> >>>>> Hi Abe,
> >>>>> thank you very much for surfacing the question. I think that there
> is a
> >>>> several twists to it, so my apologies as this will be a long answer
:)
> >>>>>
> >>>>> When we’ve started working on Sqoop 2 few years back, we’ve
> >>>> intentionally pushed the Hadoop dependency as far from shared
> libraries
> >> as
> >>>> possible. The intention was that no component in common or core should
> >> be
> >>>> depending nor use any Hadoop APIs and those should be isolated to
> >> separate
> >>>> modules (execution/submission engine). The reason for that is that
> >> Hadoop
> >>>> doesn’t have particularly good track of keeping backward compatibility
> >> and
> >>>> it has bitten a lot of projects in the past. For example every single
> >>>> project that I know of that is using MR needs to have a shim layer
> that
> >> is
> >>>> dealing with the API differences (Pig [1], Hive [2], …) . The only
> >>>> exception to this that I’m aware of is Sqoop 1, where we did not had
> to
> >>>> introduce shims is only because we (shamelessly) copied code from
> >> Hadoop to
> >>>> our own code base. Nevertheless we have places where we had to do that
> >>>> detection nevertheless [3]. I’m sure that Hadoop is getting better
as
> >> the
> >>>> project matures, but I would still advise being careful of using
> various
> >>>> Hadoop APIs and limit that usage to the extend needed. There will be
> >>>> obviously situations where we want to use Hadoop API to make our life
> >>>> simpler, such as reusing their security implementation and that will
> be
> >>>> hopefully fine.
> >>>>>
> >>>>> Whereas we can be pretty sure that Sqoop Server will have Hadoop
> >>>> libraries on the class-path and the concern there was more about
> >>>> introducing backward incompatible changes that is hopefully less
> >> important
> >>>> nowadays, not introducing Hadoop dependency on client side had a
> >> different
> >>>> reason. Hadoop common is quite important jar that have huge number of
> >>>> dependencies - check out the list at it’s pom file [4]. This is a
> >> problem
> >>>> because the Sqoop client is meant to be small and easily reusable
> wheres
> >>>> depending on Hadoop will force the application developer to certain
> >> library
> >>>> versions that are dictated by Hadoop (like guava, commons-*). And that
> >>>> forces people to do various weird things such as using custom class
> >> loaders
> >>>> to isolate those libraries from main application and making the
> >> situation
> >>>> in most cases even worst, because Hadoop libraries assumes “ownership”
> >> of
> >>>> the underlaying JVM and run a lot of eternal threads per class-loader.
> >>>> Hence I would advise being double careful when introducing dependency
> on
> >>>> Hadoop (common) for our client.
> >>>>>
> >>>>> I’m wondering what we’re trying to achieve by moving the dependency
> >> from
> >>>> “provided” to “compile”? Do we want to just ensure that it’s
always on
> >> the
> >>>> Server side or is the intent to get it to the client?
> >>>>>
> >>>>> Jarcec
> >>>>>
> >>>>> Links:
> >>>>> 1: https://github.com/apache/pig/tree/trunk/shims/src
> >>>>> 2: https://github.com/apache/hive/tree/trunk/shims
> >>>>> 3:
> >>>>
> >>
> https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/mapreduce/hcat/SqoopHCatUtilities.java#L962
> >>>>> 4:
> >>>>
> >>
> http://search.maven.org/#artifactdetails%7Corg.apache.hadoop%7Chadoop-common%7C2.6.0%7Cjar
> >>>>>
> >>>>>> On Dec 10, 2014, at 7:56 AM, Abraham Elmahrek <abe@cloudera.com>
> >> wrote:
> >>>>>>
> >>>>>> Hey guys,
> >>>>>>
> >>>>>> With the work being done in Sqoop2 involving authentication,
there
> are
> >>>>>> a few classes that are being used from hadoop auth and eventually
> >>>>>> hadoop common.
> >>>>>>
> >>>>>> I'd like to gauge how folks feel about including the hadoop
> libraries
> >>>>>> as a "compile" time dependency rather than "provided". The reasons
> >>>> being:
> >>>>>>
> >>>>>> 1. Hadoop maintains wire compatibility within a major version:
> >>>>>>
> >>>>
> >>
> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Compatibility.html#Wire_compatibility
> >>>>>> 2. UserGroupInformation and other useful interfaces are marked
as
> >>>>>> "Evolving" or "Unstable":
> >>>>>>
> >>>>
> >>
> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/InterfaceClassification.html
> >>>>>> .
> >>>>>>
> >>>>>> I've been looking around and it seems most projects include
Hadoop
> as
> >>>>>> a compile time dependency:
> >>>>>>
> >>>>>> 1. Kite -
> >>>>>>
> >>>>
> >>
> https://github.com/kite-sdk/kite/blob/master/kite-hadoop-dependencies/cdh5/pom.xml
> >>>>>> 2. Flume - https://github.com/apache/flume/blob/trunk/pom.xml
> >>>>>> 3. Oozie - https://github.com/apache/oozie/tree/master/hadooplibs
> >>>>>> 4. hive - https://github.com/apache/hive/blob/trunk/pom.xml#L1067
> >>>>>>
> >>>>>> IMO wire compatibility is easier to maintain than Java API
> >>>> compatibility.
> >>>>>> There may be features in future Hadoop releases that we'll want
to
> use
> >>>>>> on the security side as well.
> >>>>>>
> >>>>>> -Abe
> >>>>>
> >>>>> --
> >>>>> You received this message because you are subscribed to the Google
> >>>> Groups "intel-sqoop" group.
> >>>>> To unsubscribe from this group and stop receiving emails from it,
> send
> >>>> an email to intel-sqoop+unsubscribe@cloudera.org.
> >>>>> To post to this group, send email to intel-sqoop@cloudera.org.
> >>>>> To view this discussion on the web visit
> >>>>
> >>
> https://groups.google.com/a/cloudera.org/d/msgid/intel-sqoop/7F91673573F5D241AFCE8EDD6A313D24572C34%40SHSMSX103.ccr.corp.intel.com
> >>>> .
> >>>>
> >>>>
> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message