spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Wendell <>
Subject Re: Remove Hadoop 1 support (Hadoop <2.2) for Spark 1.5?
Date Sun, 14 Jun 2015 00:42:36 GMT
Yeah so Steve, hopefully it's self evident, but that is a perfect
example of the kind of annoying stuff we don't want to force users to
deal with by forcing an upgrade to 2.X. Compare the pain from Spark
users of trying to reason about what to do (and btw it seems like the
answer is simply that there isn't a good answer). And that will be
experienced by every Spark users who uses AWS and the Spark ec2
scripts, which are extremely popular.

Is this pain, in aggregate, more than our cost of having a few patches
to deal with runtime reflection stuff to make things work with Hadoop
1? My feeling is that it's much more efficient for us as the Spark
maintainers to pay this cost rather than to force a lot of our users
to deal with painful upgrades.

On Sat, Jun 13, 2015 at 1:39 AM, Steve Loughran <> wrote:
>> On 12 Jun 2015, at 17:12, Patrick Wendell <> wrote:
>>  For instance at Databricks we use
>> the FileSystem library for talking to S3... every time we've tried to
>> upgrade to Hadoop 2.X there have been significant regressions in
>> performance and we've had to downgrade. That's purely anecdotal, but I
>> think you have people out there using the Hadoop 1 bindings for whom
>> upgrade would be a pain.
> ah s3n. The unloved orphan FS, which has been fairly neglected as being non-strategic
to anyone but Amazon, who have a private fork.
> s3n broke in hadopo 2.4 where the upgraded Jets3t went in with some patch which swallowed
exceptions (nobody should ever do that) and as result would NPE on a seek(0) of a file of
length(0). HADOOP-10457. Fixed in Hadoop 2.5
> Hadoop 2.6 has left S3n on maintenance out of fear of breaking more things, future work
is in s3a:,, which switched to the amazon awstoolkit JAR and moved the implementation to hadoop-aws
JAR. S3a promises: speed, partitioned upload, better auth.
> But: it's not ready for serious use in Hadoop 2.6, so don't try. You need the Hadoop
2.7 patches, which are in ASF Hadoop 2.7, will be in HDP2.3, and have been picked up in CDH5.3.
(HADOOP-11571). For Spark, the fact that the block size is being returned as 0 in getFileStatus()
could be the killer.
> Future work is going to improve performance and scale ( HADOOP-11694 )
> Now, if spark is finding problems with s3a performance, tests for this would be great
-complaints on JIRAs too. There's not enough functional testing of analytics workloads against
the object stores, especially s3 and swift. If someone volunteers to add some optional test
module for object store testing, I'll help review it and suggest some tests to generate stress
> That can be done without the leap to Hadoop 2 —though the proposed HADOOP-9565 work
allowing object stores to declare that they are and publish some of their consistency and
atomicity semantics will be Hadoop 2.8+. If you want your output committers to recognise when
the destination is an eventually constitent object store with O(n) directory rename and delete,
that's where the code will be.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message