spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Gittens <swift...@gmail.com>
Subject Re: Need clarification on spark on cluster set up instruction
Date Wed, 01 Jul 2015 17:38:47 GMT
I have a similar use case, so I wrote a python script to fix the cluster
configuration that spark-ec2 uses when you use Hadoop 2. Start a cluster
with enough machines that the hdfs system can hold 1Tb (so use instance
types that have SSDs), then follow the instructions at
http://thousandfold.net/cz/2015/07/01/installing-spark-with-hadoop-2-using-spark-ec2/.
Let me know if you have any issues.

On Mon, Jun 29, 2015 at 4:32 PM, manish ranjan <cse1.manish@gmail.com>
wrote:

>
> Hi All
>
> here goes my first question :
> Here is my use case
>
> I have 1TB data I want to process on ec2 using spark
> I have uploaded the data on ebs volume
> The instruction on amazon ec2 set up explains
> "*If your application needs to access large datasets, the fastest way to
> do that is to load them from Amazon S3 or an Amazon EBS device into an
> instance of the Hadoop Distributed File System (HDFS) on your nodes*"
>
> Now the new amazon instances don't have any physical volume
> http://aws.amazon.com/ec2/instance-types/
>
> So do I need to do a set up for HDFS separately  on ec2 (instruction also
> says The spark-ec2 script already sets up a HDFS instance for you") ? Any
> blog/write up which can help me understanding this better ?
>
> ~Manish
>
>
>

Mime
View raw message