sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhou, Richard" <richard.z...@intel.com>
Subject RE: Hadoop as Compile time dependency in Sqoop2
Date Fri, 12 Dec 2014 03:13:00 GMT
Hi Jarcec & Abe:
Thank you for your nice clarification. And I have got several opinions about it.

1.       As the Hadoop dependency is “provided” in the Sqoop server, and the real classpath
is set in catalina.properties, how to avoid compatibility mistakes? Let’s say, Sqoop server
is built successfully with Hadoop 2.5.1, showed in root pom.xml, whilst the real cluster is
Hadoop 2.5.0. Having said that there are only minor changes between these two minor release,
there should be some unexpected exception still.

2.       If “compile” is used in both client and server side. The wire compatibility is
confirmed in authentication communication between client and server. But another compatibility
surfaces, that the Sqoop server depends different version of Hadoop-common. 2.6.0 from “compile”,
and 2.5.0 (real cluster version) from classpath in catalina.properties.

3.       As Abe said, if we use partially “compile” (in client side) and partially “provided”(
in server side), I agree that there must be some wire compatibility issues.

4.       The best solution to resolve all compatibility issues is that use “provided”
in client and server side with Hadoop-common lib from classpath in real cluster, which must
be 2.6.0 or later. However, it is impossible to make all users use Hadoop 2.6.0 or later.

So, I am considering that is it a little rush to support delegation token currently? Since
it is the latest feature in Hadoop 2.6.0, whilst Sqoop only support 2.5.1. Having said sooner
or later Hadoop 2.6.0 will be supported in the near future, Sqoop must support Hadoop 2.5.1
or before for a long time as well. Maybe we should re-open delegation token support at Hadoop
3.* period, as delegation token should be supported that time. And as for Kerberos support
task (SQOOP-1525), it could be finished with doAs function completed. Actually this code is
ready, and the reason I have not uploaded for review is that I think delegation token is a
better solution to handle this. There is no need to commit doAs code and then rewrite with
delegation token. As for delegation token support, it could be put into improvement of Kerberos


From: Abraham Elmahrek [mailto:abe@cloudera.com]
Sent: Friday, December 12, 2014 8:14 AM
To: dev@sqoop.apache.org
Cc: Zhou, Richard
Subject: Re: Hadoop as Compile time dependency in Sqoop2

I'll have to do a bit of experimentation to better understand packaging and dependencies.
If we make hadoop-common a compile time requirement conditionally in sqoop-core, this should
affect the classpath of the other components in the server? In the DependencyManagement section
of the root pom, it would still be marked as provided?


On Thu, Dec 11, 2014 at 5:50 PM, Jarek Jarcec Cecho <jarcec@apache.org<mailto:jarcec@apache.org>>
Got it, so the proposal is really to ship Hadoop libraries as part of our distribution (tarball)
and not let users to configure Sqoop using existing ones. I personally don’t feel entirely
comfortable doing so as I’m afraid that a lot of troubles will pop up on the way (given
my experience), but I’m open to give it a try. Just to be on the same page, we want to package
the Hadoop-common with server only right? So I’m assuming that the “compile” dependency
will be on sqoop-core rather then sqoop-common (that is shared between client and server).


> On Dec 11, 2014, at 3:34 PM, Abraham Elmahrek <abe@cloudera.com<mailto:abe@cloudera.com>>
> Jarcec,
> I believe that providing delegation support requires using a class on the
> server side that is only available in hadoop-common as of Hadoop 2.6.0 [1].
> This seems like reason enough to change from "provided" to "compile" given
> the feature may not exist in previous versions of Hadoop2.
> Also, requiring that Sqoop2 must be used with Hadoop 2.6.0 or newer doesn't
> seem like a great idea. It delegates hadoop version management to the users
> of Sqoop2, where it might be better to be handled by devs?
> 1. https://issues.apache.org/jira/browse/HADOOP-11083
> On Thu, Dec 11, 2014 at 4:50 PM, Jarek Jarcec Cecho <jarcec@apache.org<mailto:jarcec@apache.org>>
> wrote:
>> Nope not at all Abe, I also feel that client and server changes should be
>> discussed separately as there are different reasons/concerns of why or why
>> not introduce Hadoop dependencies there.
>> For the server side and for the security portion, I feel that we had good
>> discussion with Richard while back and I do not longer have concerns about
>> using those APIs. I’ll advise caution nevertheless. What we are trying to
>> achieve by changing the scope from “provided” to “compile” here? To my best
>> knowledge [1] the difference is only that “provided” means that the
>> dependency is not retrieved and stored in resulting package and that users
>> have to add it manually after installation. I’m not immediately seeing any
>> impact on the code though.
>> Jarcec
>> Links:
>> 1:
>> http://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html
>>> On Dec 11, 2014, at 8:41 AM, Abraham Elmahrek <abe@cloudera.com<mailto:abe@cloudera.com>>
>>> Jarcec,
>>> Sorry to bud in... you make a good point on the client side. Would you
>> mind
>>> if we discussed the server side a bit? Re-using the same mechanism on the
>>> server side does require "compile" scope dependencies on Hadoop. Would
>> that
>>> be ok? Are the concerns mainly around the client?
>>> -Abe
>>> On Thu, Dec 11, 2014 at 10:30 AM, Jarek Jarcec Cecho <jarcec@apache.org<mailto:jarcec@apache.org>>
>>> wrote:
>>>> Got it Richard, thank you very much for the nice summary! I’m wondering
>>>> what is the use case for delegation tokens on client side? Is it to
>> support
>>>> integration with Oozie?
>>>> I do know that Beeline is depending on Hadoop common and that is
>> actually
>>>> a very good example. I’ve seen sufficient number of users struggling
>> with
>>>> this dependency - using various workarounds for the classpath issue,
>> having
>>>> need to copy over Hadoop configuration files from real cluster (because
>>>> otherwise portion of the security didn’t work at all, something with
>>>> auth_to_local rules) and a lot of more. That is why I’m advising being
>>>> careful here.
>>>> Jarcec
>>>>> On Dec 11, 2014, at 12:17 AM, Zhou, Richard <richard.zhou@intel.com<mailto:richard.zhou@intel.com>>
>>>> wrote:
>>>>> Hi Jarcec:
>>>>> Thank you very much for your clarification about the history.
>>>>> The root cause for why we want to change "provided" to "compile" is to
>>>> implement "Delegation Token Support" [1], review board [2]. The status
>> in
>>>> Hadoop is showed below.
>>>>> Hadoop 2.5.1 or before: all classes used to implement Kerberos support
>>>> is in Hadoop-auth component, which depends only several libs with
>>>> non-Hadoop related lib. And it is added in Sqoop client side (shell
>>>> component [3]) as "compile" as we agreed before.
>>>>> Hadoop 2.6.0: There is a refactor to support delegation token in Hadoop
>>>> [4]. Most components in Hadoop, such as RM, Httpfs and Kms, have
>> rewritten
>>>> authentication mechanism to use delegation token. However, all
>> delegation
>>>> token related class is in Hadoop-common instead of Hadoop-auth, because
>> it
>>>> uses UserGroupInfomation class.
>>>>> So if Sqoop need to support delegation token, it has to include
>>>> Hadoop-common lib, because I believe that copying code is an
>> unacceptable
>>>> solution. Even using Hadoop shims, which is a good solution to support
>>>> different version of Hadoop (I am +1 on writing a Hadoop shims in Sqoop
>>>> like pig, hive etc.), the Hadoop-common is also a dependency. For
>> example,
>>>> the client side (beeline) in hive depends on Hadoop-common lib [5]. So I
>>>> don't think it is a big problem to add Hadoop-common in.
>>>>> Additionally, I agree with Abe that wire compatibility is another
>> reason
>>>> to change "provided" to "compile", since it is in "Unstable" state.
>> There
>>>> will be a potential problem in the future.
>>>>> So I prefer to add Hadoop-common lib as "compile" to make "Delegation
>>>> Token Support" happen.
>>>>> Add intel-sqoop@cloudera.org<mailto:intel-sqoop@cloudera.org>.
>>>>> Links:
>>>>> 1: https://issues.apache.org/jira/browse/SQOOP-1776
>>>>> 2: https://reviews.apache.org/r/28795/
>>>>> 3: https://github.com/apache/sqoop/blob/sqoop2/shell/pom.xml#L75
>>>>> 4: https://issues.apache.org/jira/browse/HADOOP-10771
>>>>> 5: https://github.com/apache/hive/blob/trunk/beeline/pom.xml#L133
>>>>> Richard
>>>>> -----Original Message-----
>>>>> From: Jarek Jarcec Cecho [mailto:jarcec@gmail.com<mailto:jarcec@gmail.com>]
On Behalf Of Jarek
>>>> Jarcec Cecho
>>>>> Sent: Thursday, December 11, 2014 1:43 PM
>>>>> To: dev@sqoop.apache.org<mailto:dev@sqoop.apache.org>
>>>>> Subject: Re: Hadoop as Compile time dependency in Sqoop2
>>>>> Hi Abe,
>>>>> thank you very much for surfacing the question. I think that there is
>>>> several twists to it, so my apologies as this will be a long answer :)
>>>>> When we’ve started working on Sqoop 2 few years back, we’ve
>>>> intentionally pushed the Hadoop dependency as far from shared libraries
>> as
>>>> possible. The intention was that no component in common or core should
>> be
>>>> depending nor use any Hadoop APIs and those should be isolated to
>> separate
>>>> modules (execution/submission engine). The reason for that is that
>> Hadoop
>>>> doesn’t have particularly good track of keeping backward compatibility
>> and
>>>> it has bitten a lot of projects in the past. For example every single
>>>> project that I know of that is using MR needs to have a shim layer that
>> is
>>>> dealing with the API differences (Pig [1], Hive [2], …) . The only
>>>> exception to this that I’m aware of is Sqoop 1, where we did not had to
>>>> introduce shims is only because we (shamelessly) copied code from
>> Hadoop to
>>>> our own code base. Nevertheless we have places where we had to do that
>>>> detection nevertheless [3]. I’m sure that Hadoop is getting better as
>> the
>>>> project matures, but I would still advise being careful of using various
>>>> Hadoop APIs and limit that usage to the extend needed. There will be
>>>> obviously situations where we want to use Hadoop API to make our life
>>>> simpler, such as reusing their security implementation and that will be
>>>> hopefully fine.
>>>>> Whereas we can be pretty sure that Sqoop Server will have Hadoop
>>>> libraries on the class-path and the concern there was more about
>>>> introducing backward incompatible changes that is hopefully less
>> important
>>>> nowadays, not introducing Hadoop dependency on client side had a
>> different
>>>> reason. Hadoop common is quite important jar that have huge number of
>>>> dependencies - check out the list at it’s pom file [4]. This is a
>> problem
>>>> because the Sqoop client is meant to be small and easily reusable wheres
>>>> depending on Hadoop will force the application developer to certain
>> library
>>>> versions that are dictated by Hadoop (like guava, commons-*). And that
>>>> forces people to do various weird things such as using custom class
>> loaders
>>>> to isolate those libraries from main application and making the
>> situation
>>>> in most cases even worst, because Hadoop libraries assumes “ownership”
>> of
>>>> the underlaying JVM and run a lot of eternal threads per class-loader.
>>>> Hence I would advise being double careful when introducing dependency on
>>>> Hadoop (common) for our client.
>>>>> I’m wondering what we’re trying to achieve by moving the dependency
>> from
>>>> “provided” to “compile”? Do we want to just ensure that it’s always
>> the
>>>> Server side or is the intent to get it to the client?
>>>>> Jarcec
>>>>> Links:
>>>>> 1: https://github.com/apache/pig/tree/trunk/shims/src
>>>>> 2: https://github.com/apache/hive/tree/trunk/shims
>>>>> 3:
>> https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/mapreduce/hcat/SqoopHCatUtilities.java#L962
>>>>> 4:
>> http://search.maven.org/#artifactdetails%7Corg.apache.hadoop%7Chadoop-common%7C2.6.0%7Cjar
>>>>>> On Dec 10, 2014, at 7:56 AM, Abraham Elmahrek <abe@cloudera.com<mailto:abe@cloudera.com>>
>> wrote:
>>>>>> Hey guys,
>>>>>> With the work being done in Sqoop2 involving authentication, there
>>>>>> a few classes that are being used from hadoop auth and eventually
>>>>>> hadoop common.
>>>>>> I'd like to gauge how folks feel about including the hadoop libraries
>>>>>> as a "compile" time dependency rather than "provided". The reasons
>>>> being:
>>>>>> 1. Hadoop maintains wire compatibility within a major version:
>> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Compatibility.html#Wire_compatibility
>>>>>> 2. UserGroupInformation and other useful interfaces are marked as
>>>>>> "Evolving" or "Unstable":
>> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/InterfaceClassification.html
>>>>>> .
>>>>>> I've been looking around and it seems most projects include Hadoop
>>>>>> a compile time dependency:
>>>>>> 1. Kite -
>> https://github.com/kite-sdk/kite/blob/master/kite-hadoop-dependencies/cdh5/pom.xml
>>>>>> 2. Flume - https://github.com/apache/flume/blob/trunk/pom.xml
>>>>>> 3. Oozie - https://github.com/apache/oozie/tree/master/hadooplibs
>>>>>> 4. hive - https://github.com/apache/hive/blob/trunk/pom.xml#L1067
>>>>>> IMO wire compatibility is easier to maintain than Java API
>>>> compatibility.
>>>>>> There may be features in future Hadoop releases that we'll want to
>>>>>> on the security side as well.
>>>>>> -Abe
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>> Groups "intel-sqoop" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to intel-sqoop+unsubscribe@cloudera.org<mailto:intel-sqoop%2Bunsubscribe@cloudera.org>.
>>>>> To post to this group, send email to intel-sqoop@cloudera.org<mailto:intel-sqoop@cloudera.org>.
>>>>> To view this discussion on the web visit
>> https://groups.google.com/a/cloudera.org/d/msgid/intel-sqoop/7F91673573F5D241AFCE8EDD6A313D24572C34%40SHSMSX103.ccr.corp.intel.com
>>>> .
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message