I have run Spark jobs on multiple 20GB+ files (groupByKey() on filtered contents of these files) via s3n:// and it all worked. Well, if you consider taking forever to read in 20GB worth of a file over a network connection (which is the limiting factor in this scenario) as "worked".

I quickly realized that the best thing is to set up a Hadoop cluster (I have a name node running with a bunch of data nodes on the same nodes as the Spark cluster) using the ephemeral space on each node for speed. Running the same jobs on the same 20GB files in this setup is factors faster than over s3n, I am talking a few seconds to read in the files in a 16 node cluster.

You can pick the m1.xlarge instance for this (or any other instance that offers lots of ephemeral disk space), it comes with 1.6TB of ephemeral disks in 4x400GB partitions - you can put these in a RAID0 stripe configuration to create one device you can put in your HDFS pool. If you take a 10+ node cluster - this adds up to quite a lot of local space. If a machine goes down the ephemeral space goes with it but you can set the replication factor in Hadoop so you are covered. Of course I do not rely on the ephemeral space for real persistence but for transient calculations it is great as cache for jobs that you would otherwise run on S3 or EBS.

HDFS is one of the rare really free distributed parallel filesystems out there. I did not have the time to spend 3 months learning how Lustre works ;) or the money to pay IBM for GPFS so the only thing really left is HDFS.

Ognen


On Sun, Jan 26, 2014 at 8:18 PM, kamatsuoka <kenjim@gmail.com> wrote:
The  hadoop docs about s3 <http://wiki.apache.org/hadoop/AmazonS3>   (linked
to by the Spark docs) say that s3n files are subject to "the 5GB limit on
file size imposed by S3."  However,  limit was raised
<http://www.computerworld.com/s/article/9200763/Amazon_s_S3_can_now_store_files_of_up_to_5TB>
about three years ago.  So it wasn't clear to me whether this limit still
applies to Hadoops s3n urls.

Well, I tried running a spark job on a 200GB s3n file, and it ran fine.  Has
this been other people's experience?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/s3n-5GB-tp943.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.