hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11656) Classpath isolation for downstream clients
Date Mon, 02 Mar 2015 03:18:05 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342656#comment-14342656

Steve Loughran commented on HADOOP-11656:

I see the need, appreciate the idea, know how much downstream projects will beneit.

but...  "we are allowed to break things in the 3.x line" is not the same as "we should break
things in the 3.x line". I need to understand what the plan is here, especially as "little"
things like HADOOP-11293 show that what is considered private is in fact used in things like
YARN apps downstream, as is HttpServer2, AmIPFilter.

Do you propose writing your own classloader? If so, we're in trouble —based on my experience
with *every single classloader I have encountered*. The consensus has gathered around OSGi
not because it is any better than other people's attempts, it is simply no worse, and with
"a standard", you the individual don't take the hit for: security problems, .class leakage,
object equality breakage, classloader leakage, etc etc. Simple example, UGI relies on being
a singleton for its identity management. Embrace classloaders and you have >1 UGI singleton,
so had better be confident that their doAs identities worked as required.

the other strategy "ultra-lean client" is appealing, though we're fairly contaminated with
Guava, commons-logging, SLF4J, httpclient, commons-lang, etc. The notion of "single client
JAR" is going to be hard to pull off without embracing Shading, and the wrongness that comes
from that.

There's another strategy, which is pure-REST-client. Why do we need an HDFS client using Hadoop
IPC when we have webHDFS? Same for YARN? Even a YARN app shouldn't need to pull in yarn-*.jar,
though there's enough IPC there & other things you probably would have to.

I know HDFS-6200 covers trying to do have a specific client JAR for HDFS, HADOOP-1815 for
hadoop itself; I was talking with Haohui only last week on this topic, along with the specific
topic of Jetty. (Actually I was proposing a facebook "down with Guava" page but that won't
solve the problem at hand)

bq. Updating your dependencies is a straight forward task.

Only if you can determine them at compile time. Even the change where we moved s3n:// support
out of hadoop-common and into hadoop-aws, including its transitive dependencies, was risky
enough as it meant that anything assuming the s3n:// implementation was in hadoop-common &
dependencies would break -a breakage that happens at runtime, not compile.

some POM-only modules, e.g "hadop-client-complete" are one tactic, another, more subtle, is
to have hadoop-common as is, but have some thinner ones you can pull in "hadoop-lean-client".

Returning to the matter in hand, you call out Guava. Is it the case that Guava is the specific
pain-point? Because we all hate being stuck on Guava 11 —but have not upgraded because of
downstream apps that would get broken if we moved off it. It may be that we can work together
across the ASF projects & see how we can do a co-ordinated Guava update, —hopefully
with less pain than the 2013 protobuf update— and come up with a common strategy of dealing
with Guava and other google libraries whose backwards compatibility story isn't great. 

Otherwise, I'm more in favour of lean clients with minimal dependencies (especially not Guava),
with classpath isolation through OSGi an option. There's been a lot of historical work there,
which could be restarted.

ps, we don't really make dependency promises. If you look at the [Hadoop compatibility document|http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html#Java_Classpath],
you can see we explicitly say "no guarantees". That's not an accident. We're just being somewhat
cautious about updating things. If, say, HBase, accumulo & Oozie all wanted a co-ordinated
update, we could try. 

> Classpath isolation for downstream clients
> ------------------------------------------
>                 Key: HADOOP-11656
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11656
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Sean Busbey
>            Assignee: Sean Busbey
>              Labels: classloading, classpath, dependencies
> Currently, Hadoop exposes downstream clients to a variety of third party libraries. As
our code base grows and matures we increase the set of libraries we rely on. At the same time,
as our user base grows we increase the likelihood that some downstream project will run into
a conflict while attempting to use a different version of some library we depend on. This
has already happened with i.e. Guava several times for HBase, Accumulo, and Spark (and I'm
sure others).
> While YARN-286 and MAPREDUCE-1700 provided an initial effort, they default to off and
they don't do anything to help dependency conflicts on the driver side or for folks talking
to HDFS directly. This should serve as an umbrella for changes needed to do things thoroughly
on the next major version.
> We should ensure that downstream clients
> 1) can depend on a client artifact for each of HDFS, YARN, and MapReduce that doesn't
pull in any third party dependencies
> 2) only see our public API classes (or as close to this as feasible) when executing user
provided code, whether client side in a launcher/driver or on the cluster in a container or
within MR.
> This provides us with a double benefit: users get less grief when they want to run substantially
ahead or behind the versions we need and the project is freer to change our own dependency
versions because they'll no longer be in our compatibility promises.
> Project specific task jiras to follow after I get some justifying use cases written in
the comments.

This message was sent by Atlassian JIRA

View raw message