spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Gittens <>
Subject Re: Need clarification on spark on cluster set up instruction
Date Wed, 01 Jul 2015 17:38:47 GMT
I have a similar use case, so I wrote a python script to fix the cluster
configuration that spark-ec2 uses when you use Hadoop 2. Start a cluster
with enough machines that the hdfs system can hold 1Tb (so use instance
types that have SSDs), then follow the instructions at
Let me know if you have any issues.

On Mon, Jun 29, 2015 at 4:32 PM, manish ranjan <>

> Hi All
> here goes my first question :
> Here is my use case
> I have 1TB data I want to process on ec2 using spark
> I have uploaded the data on ebs volume
> The instruction on amazon ec2 set up explains
> "*If your application needs to access large datasets, the fastest way to
> do that is to load them from Amazon S3 or an Amazon EBS device into an
> instance of the Hadoop Distributed File System (HDFS) on your nodes*"
> Now the new amazon instances don't have any physical volume
> So do I need to do a set up for HDFS separately  on ec2 (instruction also
> says The spark-ec2 script already sets up a HDFS instance for you") ? Any
> blog/write up which can help me understanding this better ?
> ~Manish

View raw message