spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Weald <>
Subject Re: s3n > 5GB
Date Mon, 27 Jan 2014 17:24:44 GMT
I have run Hadoop + spark jobs on large s3n files without an issue. That
being said if you have very large files you might want to consider using
s3:// instead, as that uses a HDFS block format compatible storage which
means you can more effectively split your large file between map tasks.

In my experience I also had reliability issues with jobs failing due to
read problems when using s3n with large files. These issues went away when
switching to s3://. The downside of course is that you can no longer view
files written with s3:// in the AWS console which means you need to use a
HDFS compatible viewing tool such as the hdfs command line utility.


On Sun, Jan 26, 2014 at 7:58 PM, Ognen Duzlevski

> I have run Spark jobs on multiple 20GB+ files (groupByKey() on filtered
> contents of these files) via s3n:// and it all worked. Well, if you
> consider taking forever to read in 20GB worth of a file over a network
> connection (which is the limiting factor in this scenario) as "worked".
> I quickly realized that the best thing is to set up a Hadoop cluster (I
> have a name node running with a bunch of data nodes on the same nodes as
> the Spark cluster) using the ephemeral space on each node for speed.
> Running the same jobs on the same 20GB files in this setup is factors
> faster than over s3n, I am talking a few seconds to read in the files in a
> 16 node cluster.
> You can pick the m1.xlarge instance for this (or any other instance that
> offers lots of ephemeral disk space), it comes with 1.6TB of ephemeral
> disks in 4x400GB partitions - you can put these in a RAID0 stripe
> configuration to create one device you can put in your HDFS pool. If you
> take a 10+ node cluster - this adds up to quite a lot of local space. If a
> machine goes down the ephemeral space goes with it but you can set the
> replication factor in Hadoop so you are covered. Of course I do not rely on
> the ephemeral space for real persistence but for transient calculations it
> is great as cache for jobs that you would otherwise run on S3 or EBS.
> HDFS is one of the rare really free distributed parallel filesystems out
> there. I did not have the time to spend 3 months learning how Lustre works
> ;) or the money to pay IBM for GPFS so the only thing really left is HDFS.
> Ognen
> On Sun, Jan 26, 2014 at 8:18 PM, kamatsuoka <> wrote:
>> The  hadoop docs about s3 <>
>> (linked
>> to by the Spark docs) say that s3n files are subject to "the 5GB limit on
>> file size imposed by S3."  However,  limit was raised
>> <
>> >
>> about three years ago.  So it wasn't clear to me whether this limit still
>> applies to Hadoops s3n urls.
>> Well, I tried running a spark job on a 200GB s3n file, and it ran fine.
>>  Has
>> this been other people's experience?
>> --
>> View this message in context:
>> Sent from the Apache Spark User List mailing list archive at

View raw message