hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11656) Classpath isolation for downstream clients
Date Mon, 02 Mar 2015 18:54:06 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343565#comment-14343565

Colin Patrick McCabe commented on HADOOP-11656:

Thank you for filing this, [~busbey].  +1000 for fixing this... it is a huge pain point in
Hadoop deployments.

bq. Steve wrote: There's another strategy, which is pure-REST-client. Why do we need an HDFS
client using Hadoop IPC when we have webHDFS? Same for YARN? Even a YARN app shouldn't need
to pull in yarn-*.jar, though there's enough IPC there & other things you probably would
have to.

A pure REST client is slower than a pure java client, and can't do things like zero-copy reads,
short circuit reads, and so forth.  Another way of realizing this is to see that httpfs and
webfs have been around for a long time, and haven't solved this problem for our users.

bq. the other strategy "ultra-lean client" is appealing, though we're fairly contaminated
with Guava, commons-logging, SLF4J, httpclient, commons-lang, etc. The notion of "single client
JAR" is going to be hard to pull off without embracing Shading, and the wrongness that comes
from that.

Guava is a really nice library.  It's nice on the server, and it's just as nice on the client.
 We had this discussion earlier when someone attempted to remove Guava from the client...
"that dog won't hunt."  And even if it did, we have Jackson, Protobuf, AmazonAWS, zookeeper,
jersey, glassfish, avro, jetty, and on and on.

We *can't* solve this problem by minimizing dependencies.  Because even if we do a huge amount
of code-worsening wheel-reinvention to get rid of our nice utility libraries, we still are
stuck with dependencies like Protobuf and Jetty.  The Protobuf 2.4.1 -> 2.5.0 transition
caused a huge amount of pain for users and developers.  And we all know about the security
implications of using old libraries.  In a larger sense, good software architecture should
involve code reuse and libraries when appropriate.  Treating dependencies as "contamination"
will just result in more "not invented here" syndrome.  It doesn't scale.

bq. ps, we don't really make dependency promises. If you look at the Hadoop compatibility
document, you can see we explicitly say "no guarantees". That's not an accident. We're just
being somewhat cautious about updating things. If, say, HBase, accumulo & Oozie all wanted
a co-ordinated update, we could try.

"Not making dependency promises" is just kicking the problem out to our users.  It makes people
unwilling to upgrade because they don't know if their code will be broken by the removal or
alteration of a jar they need.  Case in point: Jackson 1.8.8 -> 1.9 broke a lot of user
code because it removed {{defaultPrettyPrintingWriter}} and replaced it with a function called
{{writerWithDefaultPrettyPrinter}}.  This is why some enterprise distros didn't pick up the

We have tried dependency harmonization in the past.  It doesn't work, because different projects
have different release schedules and different needs.  Not to mention different communities.
 Also, projects like HBase want to support multiple versions of Hadoop.  This means that they
either have to live with mixed versions of things like Guava, Jetty, etc. or agree to never
update dependencies.

bq. Do you propose writing your own classloader? If so, we're in trouble —based on my experience
with every single classloader I have encountered. The consensus has gathered around OSGi not
because it is any better than other people's attempts, it is simply no worse, and with "a
standard", you the individual don't take the hit for: security problems, .class leakage, object
equality breakage, classloader leakage, etc etc. Simple example, UGI relies on being a singleton
for its identity management. Embrace classloaders and you have >1 UGI singleton, so had
better be confident that their doAs identities worked as required.

Hadoop is a big project and worth the effort to manage our own CLASSPATH.  If there are problems
we can work through them.  I am not opposed to OSGi but I think that is a separate discussion.

> Classpath isolation for downstream clients
> ------------------------------------------
>                 Key: HADOOP-11656
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11656
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Sean Busbey
>            Assignee: Sean Busbey
>              Labels: classloading, classpath, dependencies
> Currently, Hadoop exposes downstream clients to a variety of third party libraries. As
our code base grows and matures we increase the set of libraries we rely on. At the same time,
as our user base grows we increase the likelihood that some downstream project will run into
a conflict while attempting to use a different version of some library we depend on. This
has already happened with i.e. Guava several times for HBase, Accumulo, and Spark (and I'm
sure others).
> While YARN-286 and MAPREDUCE-1700 provided an initial effort, they default to off and
they don't do anything to help dependency conflicts on the driver side or for folks talking
to HDFS directly. This should serve as an umbrella for changes needed to do things thoroughly
on the next major version.
> We should ensure that downstream clients
> 1) can depend on a client artifact for each of HDFS, YARN, and MapReduce that doesn't
pull in any third party dependencies
> 2) only see our public API classes (or as close to this as feasible) when executing user
provided code, whether client side in a launcher/driver or on the cluster in a container or
within MR.
> This provides us with a double benefit: users get less grief when they want to run substantially
ahead or behind the versions we need and the project is freer to change our own dependency
versions because they'll no longer be in our compatibility promises.
> Project specific task jiras to follow after I get some justifying use cases written in
the comments.

This message was sent by Atlassian JIRA

View raw message