spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ognen Duzlevski <>
Subject Re: s3n > 5GB
Date Mon, 27 Jan 2014 03:58:11 GMT
I have run Spark jobs on multiple 20GB+ files (groupByKey() on filtered
contents of these files) via s3n:// and it all worked. Well, if you
consider taking forever to read in 20GB worth of a file over a network
connection (which is the limiting factor in this scenario) as "worked".

I quickly realized that the best thing is to set up a Hadoop cluster (I
have a name node running with a bunch of data nodes on the same nodes as
the Spark cluster) using the ephemeral space on each node for speed.
Running the same jobs on the same 20GB files in this setup is factors
faster than over s3n, I am talking a few seconds to read in the files in a
16 node cluster.

You can pick the m1.xlarge instance for this (or any other instance that
offers lots of ephemeral disk space), it comes with 1.6TB of ephemeral
disks in 4x400GB partitions - you can put these in a RAID0 stripe
configuration to create one device you can put in your HDFS pool. If you
take a 10+ node cluster - this adds up to quite a lot of local space. If a
machine goes down the ephemeral space goes with it but you can set the
replication factor in Hadoop so you are covered. Of course I do not rely on
the ephemeral space for real persistence but for transient calculations it
is great as cache for jobs that you would otherwise run on S3 or EBS.

HDFS is one of the rare really free distributed parallel filesystems out
there. I did not have the time to spend 3 months learning how Lustre works
;) or the money to pay IBM for GPFS so the only thing really left is HDFS.


On Sun, Jan 26, 2014 at 8:18 PM, kamatsuoka <> wrote:

> The  hadoop docs about s3 <>
> (linked
> to by the Spark docs) say that s3n files are subject to "the 5GB limit on
> file size imposed by S3."  However,  limit was raised
> <
> >
> about three years ago.  So it wasn't clear to me whether this limit still
> applies to Hadoops s3n urls.
> Well, I tried running a spark job on a 200GB s3n file, and it ran fine.
>  Has
> this been other people's experience?
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at

View raw message