spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Remove Hadoop 1 support (Hadoop <2.2) for Spark 1.5?
Date Fri, 12 Jun 2015 16:42:24 GMT
On Fri, Jun 12, 2015 at 5:12 PM, Patrick Wendell <> wrote:
> I would like to understand though Sean - what is the proposal exactly?
> Hadoop 2 itself supports all of the Hadoop 1 API's, so things like
> removing the Hadoop 1 variant of sc.hadoopFile, etc, I don't think

Not entirely; you can see some binary incompatibilities that have
bitten recently. A Hadoop 1 program does not in general work on Hadoop
2 because of this.

Part of my thinking is that I'm not clear Hadoop 1.x, and 2.0.x, fully
works anymore anyway. See for example SPARK-8057 recently. I recall
similar problems with Hadoop 2.0.x-era releases and the Spark build
for that which is basically the 'cdh4' build.

So one benefit is skipping whatever work would be needed to continue
to fix this up, and, the argument is there may be less loss of
functionality than it seems. The other is being able to use later
APIs. This much is a little minor.

> The main reason I'd push back is that I do think there are still
> people running the older versions. For instance at Databricks we use
> the FileSystem library for talking to S3... every time we've tried to
> upgrade to Hadoop 2.X there have been significant regressions in
> performance and we've had to downgrade. That's purely anecdotal, but I
> think you have people out there using the Hadoop 1 bindings for whom
> upgrade would be a pain.

Yeah, that's the question. Is anyone out there using 1.x? More
anecdotes wanted. That might be the most interesting question.

No CDH customers would have been for a long while now, for example.
(Still a small number of CDH 4 customers out there though, and that's
2.0.x or so, but that's a gray area.)

Is the S3 library thing really related to Hadoop 1.x? that comes from
jets3t and that's independent.

> In terms of our maintenance cost, to me the much bigger cost for us
> IMO is dealing with differences between e.g. 2.2, 2.4, and 2.6 where
> major new API's were added. In comparison the Hadoop 1 vs 2 seems

Really? I'd say the opposite. No APIs that are only in 2.2, let alone
only in a later version, can be in use now, right? 1.x wouldn't work
at all then. I don't know of any binary incompatibilities of the type
between 1.x and 2.x, which we have had to shim to make work.

In both cases dependencies have to be harmonized here and there, yes.
That won't change.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message