Here what I would suggest. In order to protect from Human error, start a ec2 instance with the ec2 script. Copy over the folders as they are well integrated with hdfs, compatible drivers and versions. Change configuration to configure slaves and masters. 
Sorry if its offensive :) .. I found dealing with various hadoop incompatibilities very taxing. 

Mayur Rustagi
Ph: +919632149971

On Sun, Jan 19, 2014 at 8:23 PM, Ognen Duzlevski <> wrote:
On Sun, Jan 19, 2014 at 2:49 PM, Ognen Duzlevski <> wrote:

My basic requirement is to set everything up myself and understand it. For testing purposes my cluster has 15 xlarge instances and I guess I will just set up a hadoop cluster to run over these instances for the purposes of getting the benefits of HDFS. I would then set up hdfs over S3 with blocks.

By this I mean I would set up a Hadoop cluster running in parallel on the same instances just for the purposes of running Spark over HDFS. Is this a reasonable approach? What kind of a performance penalty (memory, CPU cycles) am I going to incur by the Hadoop daemons running just for this purpose?