sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jarek Jarcec Cecho <jar...@apache.org>
Subject Re: Hadoop as Compile time dependency in Sqoop2
Date Thu, 11 Dec 2014 23:50:12 GMT
Got it, so the proposal is really to ship Hadoop libraries as part of our distribution (tarball)
and not let users to configure Sqoop using existing ones. I personally don’t feel entirely
comfortable doing so as I’m afraid that a lot of troubles will pop up on the way (given
my experience), but I’m open to give it a try. Just to be on the same page, we want to package
the Hadoop-common with server only right? So I’m assuming that the “compile” dependency
will be on sqoop-core rather then sqoop-common (that is shared between client and server).


> On Dec 11, 2014, at 3:34 PM, Abraham Elmahrek <abe@cloudera.com> wrote:
> Jarcec,
> I believe that providing delegation support requires using a class on the
> server side that is only available in hadoop-common as of Hadoop 2.6.0 [1].
> This seems like reason enough to change from "provided" to "compile" given
> the feature may not exist in previous versions of Hadoop2.
> Also, requiring that Sqoop2 must be used with Hadoop 2.6.0 or newer doesn't
> seem like a great idea. It delegates hadoop version management to the users
> of Sqoop2, where it might be better to be handled by devs?
> 1. https://issues.apache.org/jira/browse/HADOOP-11083
> On Thu, Dec 11, 2014 at 4:50 PM, Jarek Jarcec Cecho <jarcec@apache.org>
> wrote:
>> Nope not at all Abe, I also feel that client and server changes should be
>> discussed separately as there are different reasons/concerns of why or why
>> not introduce Hadoop dependencies there.
>> For the server side and for the security portion, I feel that we had good
>> discussion with Richard while back and I do not longer have concerns about
>> using those APIs. I’ll advise caution nevertheless. What we are trying to
>> achieve by changing the scope from “provided” to “compile” here? To my best
>> knowledge [1] the difference is only that “provided” means that the
>> dependency is not retrieved and stored in resulting package and that users
>> have to add it manually after installation. I’m not immediately seeing any
>> impact on the code though.
>> Jarcec
>> Links:
>> 1:
>> http://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html
>>> On Dec 11, 2014, at 8:41 AM, Abraham Elmahrek <abe@cloudera.com> wrote:
>>> Jarcec,
>>> Sorry to bud in... you make a good point on the client side. Would you
>> mind
>>> if we discussed the server side a bit? Re-using the same mechanism on the
>>> server side does require "compile" scope dependencies on Hadoop. Would
>> that
>>> be ok? Are the concerns mainly around the client?
>>> -Abe
>>> On Thu, Dec 11, 2014 at 10:30 AM, Jarek Jarcec Cecho <jarcec@apache.org>
>>> wrote:
>>>> Got it Richard, thank you very much for the nice summary! I’m wondering
>>>> what is the use case for delegation tokens on client side? Is it to
>> support
>>>> integration with Oozie?
>>>> I do know that Beeline is depending on Hadoop common and that is
>> actually
>>>> a very good example. I’ve seen sufficient number of users struggling
>> with
>>>> this dependency - using various workarounds for the classpath issue,
>> having
>>>> need to copy over Hadoop configuration files from real cluster (because
>>>> otherwise portion of the security didn’t work at all, something with
>>>> auth_to_local rules) and a lot of more. That is why I’m advising being
>>>> careful here.
>>>> Jarcec
>>>>> On Dec 11, 2014, at 12:17 AM, Zhou, Richard <richard.zhou@intel.com>
>>>> wrote:
>>>>> Hi Jarcec:
>>>>> Thank you very much for your clarification about the history.
>>>>> The root cause for why we want to change "provided" to "compile" is to
>>>> implement "Delegation Token Support" [1], review board [2]. The status
>> in
>>>> Hadoop is showed below.
>>>>> Hadoop 2.5.1 or before: all classes used to implement Kerberos support
>>>> is in Hadoop-auth component, which depends only several libs with
>>>> non-Hadoop related lib. And it is added in Sqoop client side (shell
>>>> component [3]) as "compile" as we agreed before.
>>>>> Hadoop 2.6.0: There is a refactor to support delegation token in Hadoop
>>>> [4]. Most components in Hadoop, such as RM, Httpfs and Kms, have
>> rewritten
>>>> authentication mechanism to use delegation token. However, all
>> delegation
>>>> token related class is in Hadoop-common instead of Hadoop-auth, because
>> it
>>>> uses UserGroupInfomation class.
>>>>> So if Sqoop need to support delegation token, it has to include
>>>> Hadoop-common lib, because I believe that copying code is an
>> unacceptable
>>>> solution. Even using Hadoop shims, which is a good solution to support
>>>> different version of Hadoop (I am +1 on writing a Hadoop shims in Sqoop
>>>> like pig, hive etc.), the Hadoop-common is also a dependency. For
>> example,
>>>> the client side (beeline) in hive depends on Hadoop-common lib [5]. So I
>>>> don't think it is a big problem to add Hadoop-common in.
>>>>> Additionally, I agree with Abe that wire compatibility is another
>> reason
>>>> to change "provided" to "compile", since it is in "Unstable" state.
>> There
>>>> will be a potential problem in the future.
>>>>> So I prefer to add Hadoop-common lib as "compile" to make "Delegation
>>>> Token Support" happen.
>>>>> Add intel-sqoop@cloudera.org.
>>>>> Links:
>>>>> 1: https://issues.apache.org/jira/browse/SQOOP-1776
>>>>> 2: https://reviews.apache.org/r/28795/
>>>>> 3: https://github.com/apache/sqoop/blob/sqoop2/shell/pom.xml#L75
>>>>> 4: https://issues.apache.org/jira/browse/HADOOP-10771
>>>>> 5: https://github.com/apache/hive/blob/trunk/beeline/pom.xml#L133
>>>>> Richard
>>>>> -----Original Message-----
>>>>> From: Jarek Jarcec Cecho [mailto:jarcec@gmail.com] On Behalf Of Jarek
>>>> Jarcec Cecho
>>>>> Sent: Thursday, December 11, 2014 1:43 PM
>>>>> To: dev@sqoop.apache.org
>>>>> Subject: Re: Hadoop as Compile time dependency in Sqoop2
>>>>> Hi Abe,
>>>>> thank you very much for surfacing the question. I think that there is
>>>> several twists to it, so my apologies as this will be a long answer :)
>>>>> When we’ve started working on Sqoop 2 few years back, we’ve
>>>> intentionally pushed the Hadoop dependency as far from shared libraries
>> as
>>>> possible. The intention was that no component in common or core should
>> be
>>>> depending nor use any Hadoop APIs and those should be isolated to
>> separate
>>>> modules (execution/submission engine). The reason for that is that
>> Hadoop
>>>> doesn’t have particularly good track of keeping backward compatibility
>> and
>>>> it has bitten a lot of projects in the past. For example every single
>>>> project that I know of that is using MR needs to have a shim layer that
>> is
>>>> dealing with the API differences (Pig [1], Hive [2], …) . The only
>>>> exception to this that I’m aware of is Sqoop 1, where we did not had to
>>>> introduce shims is only because we (shamelessly) copied code from
>> Hadoop to
>>>> our own code base. Nevertheless we have places where we had to do that
>>>> detection nevertheless [3]. I’m sure that Hadoop is getting better as
>> the
>>>> project matures, but I would still advise being careful of using various
>>>> Hadoop APIs and limit that usage to the extend needed. There will be
>>>> obviously situations where we want to use Hadoop API to make our life
>>>> simpler, such as reusing their security implementation and that will be
>>>> hopefully fine.
>>>>> Whereas we can be pretty sure that Sqoop Server will have Hadoop
>>>> libraries on the class-path and the concern there was more about
>>>> introducing backward incompatible changes that is hopefully less
>> important
>>>> nowadays, not introducing Hadoop dependency on client side had a
>> different
>>>> reason. Hadoop common is quite important jar that have huge number of
>>>> dependencies - check out the list at it’s pom file [4]. This is a
>> problem
>>>> because the Sqoop client is meant to be small and easily reusable wheres
>>>> depending on Hadoop will force the application developer to certain
>> library
>>>> versions that are dictated by Hadoop (like guava, commons-*). And that
>>>> forces people to do various weird things such as using custom class
>> loaders
>>>> to isolate those libraries from main application and making the
>> situation
>>>> in most cases even worst, because Hadoop libraries assumes “ownership”
>> of
>>>> the underlaying JVM and run a lot of eternal threads per class-loader.
>>>> Hence I would advise being double careful when introducing dependency on
>>>> Hadoop (common) for our client.
>>>>> I’m wondering what we’re trying to achieve by moving the dependency
>> from
>>>> “provided” to “compile”? Do we want to just ensure that it’s always
>> the
>>>> Server side or is the intent to get it to the client?
>>>>> Jarcec
>>>>> Links:
>>>>> 1: https://github.com/apache/pig/tree/trunk/shims/src
>>>>> 2: https://github.com/apache/hive/tree/trunk/shims
>>>>> 3:
>> https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/mapreduce/hcat/SqoopHCatUtilities.java#L962
>>>>> 4:
>> http://search.maven.org/#artifactdetails%7Corg.apache.hadoop%7Chadoop-common%7C2.6.0%7Cjar
>>>>>> On Dec 10, 2014, at 7:56 AM, Abraham Elmahrek <abe@cloudera.com>
>> wrote:
>>>>>> Hey guys,
>>>>>> With the work being done in Sqoop2 involving authentication, there
>>>>>> a few classes that are being used from hadoop auth and eventually
>>>>>> hadoop common.
>>>>>> I'd like to gauge how folks feel about including the hadoop libraries
>>>>>> as a "compile" time dependency rather than "provided". The reasons
>>>> being:
>>>>>> 1. Hadoop maintains wire compatibility within a major version:
>> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Compatibility.html#Wire_compatibility
>>>>>> 2. UserGroupInformation and other useful interfaces are marked as
>>>>>> "Evolving" or "Unstable":
>> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/InterfaceClassification.html
>>>>>> .
>>>>>> I've been looking around and it seems most projects include Hadoop
>>>>>> a compile time dependency:
>>>>>> 1. Kite -
>> https://github.com/kite-sdk/kite/blob/master/kite-hadoop-dependencies/cdh5/pom.xml
>>>>>> 2. Flume - https://github.com/apache/flume/blob/trunk/pom.xml
>>>>>> 3. Oozie - https://github.com/apache/oozie/tree/master/hadooplibs
>>>>>> 4. hive - https://github.com/apache/hive/blob/trunk/pom.xml#L1067
>>>>>> IMO wire compatibility is easier to maintain than Java API
>>>> compatibility.
>>>>>> There may be features in future Hadoop releases that we'll want to
>>>>>> on the security side as well.
>>>>>> -Abe
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>> Groups "intel-sqoop" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to intel-sqoop+unsubscribe@cloudera.org.
>>>>> To post to this group, send email to intel-sqoop@cloudera.org.
>>>>> To view this discussion on the web visit
>> https://groups.google.com/a/cloudera.org/d/msgid/intel-sqoop/7F91673573F5D241AFCE8EDD6A313D24572C34%40SHSMSX103.ccr.corp.intel.com
>>>> .

View raw message