spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Goodson <mar...@skimlinks.com>
Subject Re: Reading from HDFS no faster than reading from S3 - how to tell if data locality respected?
Date Mon, 04 Aug 2014 15:36:09 GMT
Just an update on this - I have benchmarked on a cluster built with
spark-ec2 and again found that reading from hdfs is not much faster than
from s3 (about 20%).

Does anyone know how to check that data locality is being used by spark on
my cluster?

Is it surprising that access to HDFS on local disks is not much faster than
accessing s3 across the network?






-- 
Martin Goodson  |  VP Data Science
(0)20 3397 1240
[image: Inline image 1]


On Fri, Aug 1, 2014 at 10:44 AM, Martin Goodson <martin@skimlinks.com>
wrote:

> Hi all,
> I'm consistently finding that reading from HDFS is not appreciably faster
> than reading from S3 using pyspark. How can I tell whether data locality is
> being respected?
>
> In this example, reading from HDFS is only about 10% faster than reading
> the same file from S3. The files were pulled from s3 using S3distcp. (The
> file size is slightly smaller on HDFS but lets ignore that for now). This
> was run on an EMR cluster but I have found the same effect using the
> spark-ec2 script.
>
>
>
> pageshdfs=sc.textFile('hdfs:///pages/year=2014/month=05/day=01/hour=0000/*')
>
> pagess3=sc.textFile('s3n://BUCKETNAME/pages/year=2014/month=05/day=01/hour=0000/*')
>
> t=datetime.now(); pageshdfs.count(); datetime.now()-t
> 5056418
> datetime.timedelta(0, 22, 123156)
>
>
> t=datetime.now(); pagess3.count(); datetime.now()-t
> 5324499
> datetime.timedelta(0, 24, 544198)
>
> (Script: s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb
>   and ami-version 3.1.0).
>
>
> --
> Martin Goodson  |  VP Data Science
> (0)20 3397 1240
> [image: Inline image 1]
>

Mime
View raw message