sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abraham Elmahrek <...@cloudera.com>
Subject Re: Hadoop as Compile time dependency in Sqoop2
Date Thu, 11 Dec 2014 16:27:21 GMT
Jarcec,

To further clarify things...

Inspiration:
Since in the Sqoop project we've decided to use Hadoop authentication, a
dependency on Hadoop libraries is an absolute must. Having the same
mechanism for authentication as Hadoop makes sense since Sqoop
traditionally is used to transfer data from a relational database to
Hadoop. This also enables integration with other projects in the Hadoop
ecosystem easier (like oozie).

Interface issues:
Some of the more stable interfaces, such as UserGroupInformation, are
marked as "public" and "evolving", which means that they are intended to be
used externally, but are able to change per minor release. Some APIs are
marked as "public" and "unstable", which means they are intended to be used
externally, but may change in any release. Given that
"UserGroupInformation", a very stable API, can change between minor
releases, we have to limit the scope Hadoop we interface with. Id prefer
writing shims between Hadoop1 and Hadoop2 than Hadoop 1, Hadoop 2.0, Hadoop
2.1, Hadoop 2.2, etc. It's much easier to maintain wire compatibility,
which is guaranteed across major versions. Now there can be a mistake on
Hadoop's side, but it's easier to account for the mistakes, which are
fewer, than Java API changes.

Partially provided/Partially compile:
As Richard said, having partially provided and partially compile
dependencies doesn't make a lot of sense since there will likely be
compatibility errors bridging minor versions of Hadoop.

To conclude, it's difficult to have Hadoop authentication is Sqoop with out
the libraries that provide hadoop auth. It's difficult to use the hadoop
auth libraries without creating a compile time dependencies.

Some of the down sides of this is that testing with new releases of Hadoop
may be necessary given "things can happen". Also, upgrading Hadoop
dependencies in Sqoop will be a bit more tedious. I see these as acceptable
for improved functionality though.

-Abe


On Thu, Dec 11, 2014 at 2:17 AM, Zhou, Richard <richard.zhou@intel.com>
wrote:

> Hi Jarcec:
> Thank you very much for your clarification about the history.
>
> The root cause for why we want to change "provided" to "compile" is to
> implement "Delegation Token Support" [1], review board [2]. The status in
> Hadoop is showed below.
> Hadoop 2.5.1 or before: all classes used to implement Kerberos support is
> in Hadoop-auth component, which depends only several libs with non-Hadoop
> related lib. And it is added in Sqoop client side (shell component [3]) as
> "compile" as we agreed before.
> Hadoop 2.6.0: There is a refactor to support delegation token in Hadoop
> [4]. Most components in Hadoop, such as RM, Httpfs and Kms, have rewritten
> authentication mechanism to use delegation token. However, all delegation
> token related class is in Hadoop-common instead of Hadoop-auth, because it
> uses UserGroupInfomation class.
>
> So if Sqoop need to support delegation token, it has to include
> Hadoop-common lib, because I believe that copying code is an unacceptable
> solution. Even using Hadoop shims, which is a good solution to support
> different version of Hadoop (I am +1 on writing a Hadoop shims in Sqoop
> like pig, hive etc.), the Hadoop-common is also a dependency. For example,
> the client side (beeline) in hive depends on Hadoop-common lib [5]. So I
> don't think it is a big problem to add Hadoop-common in.
>
> Additionally, I agree with Abe that wire compatibility is another reason
> to change "provided" to "compile", since it is in "Unstable" state. There
> will be a potential problem in the future.
>
> So I prefer to add Hadoop-common lib as "compile" to make "Delegation
> Token Support" happen.
>
> Add intel-sqoop@cloudera.org.
>
> Links:
> 1: https://issues.apache.org/jira/browse/SQOOP-1776
> 2: https://reviews.apache.org/r/28795/
> 3: https://github.com/apache/sqoop/blob/sqoop2/shell/pom.xml#L75
> 4: https://issues.apache.org/jira/browse/HADOOP-10771
> 5: https://github.com/apache/hive/blob/trunk/beeline/pom.xml#L133
>
> Richard
>
> -----Original Message-----
> From: Jarek Jarcec Cecho [mailto:jarcec@gmail.com] On Behalf Of Jarek
> Jarcec Cecho
> Sent: Thursday, December 11, 2014 1:43 PM
> To: dev@sqoop.apache.org
> Subject: Re: Hadoop as Compile time dependency in Sqoop2
>
> Hi Abe,
> thank you very much for surfacing the question. I think that there is a
> several twists to it, so my apologies as this will be a long answer :)
>
> When we’ve started working on Sqoop 2 few years back, we’ve intentionally
> pushed the Hadoop dependency as far from shared libraries as possible. The
> intention was that no component in common or core should be depending nor
> use any Hadoop APIs and those should be isolated to separate modules
> (execution/submission engine). The reason for that is that Hadoop doesn’t
> have particularly good track of keeping backward compatibility and it has
> bitten a lot of projects in the past. For example every single project that
> I know of that is using MR needs to have a shim layer that is dealing with
> the API differences (Pig [1], Hive [2], …) . The only exception to this
> that I’m aware of is Sqoop 1, where we did not had to introduce shims is
> only because we (shamelessly) copied code from Hadoop to our own code base.
> Nevertheless we have places where we had to do that detection nevertheless
> [3]. I’m sure that Hadoop is getting better as the project matures, but I
> would still advise being careful of using various Hadoop APIs and limit
> that usage to the extend needed. There will be obviously situations where
> we want to use Hadoop API to make our life simpler, such as reusing their
> security implementation and that will be hopefully fine.
>
> Whereas we can be pretty sure that Sqoop Server will have Hadoop libraries
> on the class-path and the concern there was more about introducing backward
> incompatible changes that is hopefully less important nowadays, not
> introducing Hadoop dependency on client side had a different reason. Hadoop
> common is quite important jar that have huge number of dependencies - check
> out the list at it’s pom file [4]. This is a problem because the Sqoop
> client is meant to be small and easily reusable wheres depending on Hadoop
> will force the application developer to certain library versions that are
> dictated by Hadoop (like guava, commons-*). And that forces people to do
> various weird things such as using custom class loaders to isolate those
> libraries from main application and making the situation in most cases even
> worst, because Hadoop libraries assumes “ownership” of the underlaying JVM
> and run a lot of eternal threads per class-loader. Hence I would advise
> being double careful when introducing dependency on Hadoop (common) for our
> client.
>
> I’m wondering what we’re trying to achieve by moving the dependency from
> “provided” to “compile”? Do we want to just ensure that it’s always on the
> Server side or is the intent to get it to the client?
>
> Jarcec
>
> Links:
> 1: https://github.com/apache/pig/tree/trunk/shims/src
> 2: https://github.com/apache/hive/tree/trunk/shims
> 3:
> https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/mapreduce/hcat/SqoopHCatUtilities.java#L962
> 4:
> http://search.maven.org/#artifactdetails%7Corg.apache.hadoop%7Chadoop-common%7C2.6.0%7Cjar
>
> > On Dec 10, 2014, at 7:56 AM, Abraham Elmahrek <abe@cloudera.com> wrote:
> >
> > Hey guys,
> >
> > With the work being done in Sqoop2 involving authentication, there are
> > a few classes that are being used from hadoop auth and eventually
> > hadoop common.
> >
> > I'd like to gauge how folks feel about including the hadoop libraries
> > as a "compile" time dependency rather than "provided". The reasons being:
> >
> >   1. Hadoop maintains wire compatibility within a major version:
> >
> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Compatibility.html#Wire_compatibility
> >   2. UserGroupInformation and other useful interfaces are marked as
> >   "Evolving" or "Unstable":
> >
> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/InterfaceClassification.html
> >   .
> >
> > I've been looking around and it seems most projects include Hadoop as
> > a compile time dependency:
> >
> >   1. Kite -
> >
> https://github.com/kite-sdk/kite/blob/master/kite-hadoop-dependencies/cdh5/pom.xml
> >   2. Flume - https://github.com/apache/flume/blob/trunk/pom.xml
> >   3. Oozie - https://github.com/apache/oozie/tree/master/hadooplibs
> >   4. hive - https://github.com/apache/hive/blob/trunk/pom.xml#L1067
> >
> > IMO wire compatibility is easier to maintain than Java API compatibility.
> > There may be features in future Hadoop releases that we'll want to use
> > on the security side as well.
> >
> > -Abe
>
> --
> You received this message because you are subscribed to the Google Groups
> "intel-sqoop" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to intel-sqoop+unsubscribe@cloudera.org.
> To post to this group, send email to intel-sqoop@cloudera.org.
> To view this discussion on the web visit
> https://groups.google.com/a/cloudera.org/d/msgid/intel-sqoop/7F91673573F5D241AFCE8EDD6A313D24572C34%40SHSMSX103.ccr.corp.intel.com
> .
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message