spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Miller <cmiller11...@gmail.com>
Subject Re: newbie HDFS S3 best practices
Date Wed, 16 Mar 2016 07:59:26 GMT
If you have lots of small files, distcp should handle that well -- it's
supposed to distribute the transfer of files across the nodes in your
cluster. Conductor looks interesting if you're trying to distribute the
transfer of single, large file(s)...

right?

--
Chris Miller

On Wed, Mar 16, 2016 at 4:43 AM, Andy Davidson <
Andy@santacruzintegration.com> wrote:

> Hi Frank
>
> We have thousands of small files . Each file is between 6K to maybe 100k.
>
> Conductor looks interesting
>
> Andy
>
> From: Frank Austin Nothaft <fnothaft@berkeley.edu>
> Date: Tuesday, March 15, 2016 at 11:59 AM
> To: Andrew Davidson <Andy@SantaCruzIntegration.com>
> Cc: "user @spark" <user@spark.apache.org>
> Subject: Re: newbie HDFS S3 best practices
>
> Hard to say with #1 without knowing your application’s characteristics;
> for #2, we use conductor <https://github.com/BD2KGenomics/conductor> with
> IAM roles, .boto/.aws/credentials files.
>
> Frank Austin Nothaft
> fnothaft@berkeley.edu
> fnothaft@eecs.berkeley.edu
> 202-340-0466
>
> On Mar 15, 2016, at 11:45 AM, Andy Davidson <Andy@SantaCruzIntegration.com
> <Andy@santacruzintegration.com>> wrote:
>
> We use the spark-ec2 script to create AWS clusters as needed (we do not
> use AWS EMR)
>
>    1. will we get better performance if we copy data to HDFS before we
>    run instead of reading directly from S3?
>
>  2. What is a good way to move results from HDFS to S3?
>
>
> It seems like there are many ways to bulk copy to s3. Many of them require
> we explicitly use the AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@
> <AWS_SECRET_ACCESS_KEY@/yasemindeneme/deneme.txt>. This seems like a bad
> idea?
>
> What would you recommend?
>
> Thanks
>
> Andy
>
>
>
>

Mime
View raw message