hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Busbey (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11656) Classpath isolation for downstream clients
Date Mon, 02 Mar 2015 17:10:05 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343397#comment-14343397

Sean Busbey commented on HADOOP-11656:

but... "we are allowed to break things in the 3.x line" is not the same as "we should break
things in the 3.x line". I need to understand what the plan is here, especially as "little"
things like HADOOP-11293 show that what is considered private is in fact used in things like
YARN apps downstream, as is HttpServer2, AmIPFilter.

It's easier to make progress on these kinds of API violations if we provide better boundaries.
I think this ticket is a good start to that. If downstream folks then need things moved in
to public they can request it. As things currently are there's not much incentive to make
those requests until something breaks. This way, we can front load that conversation.

Do you propose writing your own classloader? If so, we're in trouble —based on my experience
with every single classloader I have encountered. The consensus has gathered around OSGi not
because it is any better than other people's attempts, it is simply no worse, and with "a
standard", you the individual don't take the hit for: security problems, .class leakage, object
equality breakage, classloader leakage, etc etc. Simple example, UGI relies on being a singleton
for its identity management. Embrace classloaders and you have >1 UGI singleton, so had
better be confident that their doAs identities worked as required.

Luckily (for me, for us, for the world) I've done enough Java development to know better than
to make my own classloader again. I figured a starting point would be to make the Application
and Job classloaders from the 2.x line on by default. Depending on how my schedule lines up
with whenever 3.y happens, I'd move us over to using Apache Felix (or the eclipse foundation
OSGi implementation if Felix has some gap).

the other strategy "ultra-lean client" is appealing, though we're fairly contaminated with
Guava, commons-logging, SLF4J, httpclient, commons-lang, etc. The notion of "single client
JAR" is going to be hard to pull off without embracing Shading, and the wrongness that comes
from that.

The problem with shading brokenness is that not doing it is worse, generally. Certainly I'd
rather just not use third party libraries, but I doubt that's going to be a practical approach.
SLF4J's API artifact is pretty stable, maybe exposing that will be fine? Anything is going
to be a tradeoff, so I'd rather get into the guts of that one a specific task jira when there's
code to evaluate.

There's another strategy, which is pure-REST-client. Why do we need an HDFS client using Hadoop
IPC when we have webHDFS? Same for YARN? Even a YARN app shouldn't need to pull in yarn-*.jar,
though there's enough IPC there & other things you probably would have to.

If we can show that performance is "good enough" compared to relying on Hadoop IPC, it results
in an easier to maintain client artifact, and it's hidden from downstream clients, I'm fine
with taking a few stabs at whatever approaches the community thinks viable. I'd like to see
what's doable on Hadoop IPC first.

bq. Updating your dependencies is a straight forward task.
Only if you can determine them at compile time. Even the change where we moved s3n:// support
out of hadoop-common and into hadoop-aws, including its transitive dependencies, was risky
enough as it meant that anything assuming the s3n:// implementation was in hadoop-common &
dependencies would break -a breakage that happens at runtime, not compile.

Sure, but the availability of those runtime dependencies is a framework concern, no? I don't
know the details of the s3n:// movement, but part of my goal in this ticket is that downstream
clients show have neither knowledge nor concern about which of our modules host framework

Returning to the matter in hand, you call out Guava. Is it the case that Guava is the specific
pain-point? Because we all hate being stuck on Guava 11 —but have not upgraded because of
downstream apps that would get broken if we moved off it. It may be that we can work together
across the ASF projects & see how we can do a co-ordinated Guava update, —hopefully
with less pain than the 2013 protobuf update— and come up with a common strategy of dealing
with Guava and other google libraries whose backwards compatibility story isn't great.

No, I only called out Guava because it seems almost everyone has gotten cut by its sharp edges.
I think this same issue comes up eventually no matter which library you're talking about.

ps, we don't really make dependency promises. If you look at the Hadoop compatibility document,
you can see we explicitly say "no guarantees". That's not an accident. We're just being somewhat
cautious about updating things. If, say, HBase, accumulo & Oozie all wanted a co-ordinated
update, we could try.

I'm aware that we don't formally allow guarantees. In many ways, that makes things worse IMHO.
It means more time for integration testing for all downstream folks, and those building applications
in e.g. enterprise settings often don't have the option of proactively driving when we attempt
an update.

> Classpath isolation for downstream clients
> ------------------------------------------
>                 Key: HADOOP-11656
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11656
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Sean Busbey
>            Assignee: Sean Busbey
>              Labels: classloading, classpath, dependencies
> Currently, Hadoop exposes downstream clients to a variety of third party libraries. As
our code base grows and matures we increase the set of libraries we rely on. At the same time,
as our user base grows we increase the likelihood that some downstream project will run into
a conflict while attempting to use a different version of some library we depend on. This
has already happened with i.e. Guava several times for HBase, Accumulo, and Spark (and I'm
sure others).
> While YARN-286 and MAPREDUCE-1700 provided an initial effort, they default to off and
they don't do anything to help dependency conflicts on the driver side or for folks talking
to HDFS directly. This should serve as an umbrella for changes needed to do things thoroughly
on the next major version.
> We should ensure that downstream clients
> 1) can depend on a client artifact for each of HDFS, YARN, and MapReduce that doesn't
pull in any third party dependencies
> 2) only see our public API classes (or as close to this as feasible) when executing user
provided code, whether client side in a launcher/driver or on the cluster in a container or
within MR.
> This provides us with a double benefit: users get less grief when they want to run substantially
ahead or behind the versions we need and the project is freer to change our own dependency
versions because they'll no longer be in our compatibility promises.
> Project specific task jiras to follow after I get some justifying use cases written in
the comments.

This message was sent by Atlassian JIRA

View raw message