spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Goodson <mar...@skimlinks.com>
Subject Reading from HDFS no faster than reading from S3 - how to tell if data locality respected?
Date Fri, 01 Aug 2014 09:44:00 GMT
Hi all,
I'm consistently finding that reading from HDFS is not appreciably faster
than reading from S3 using pyspark. How can I tell whether data locality is
being respected?

In this example, reading from HDFS is only about 10% faster than reading
the same file from S3. The files were pulled from s3 using S3distcp. (The
file size is slightly smaller on HDFS but lets ignore that for now). This
was run on an EMR cluster but I have found the same effect using the
spark-ec2 script.


pageshdfs=sc.textFile('hdfs:///pages/year=2014/month=05/day=01/hour=0000/*')
pagess3=sc.textFile('s3n://BUCKETNAME/pages/year=2014/month=05/day=01/hour=0000/*')

t=datetime.now(); pageshdfs.count(); datetime.now()-t
5056418
datetime.timedelta(0, 22, 123156)


t=datetime.now(); pagess3.count(); datetime.now()-t
5324499
datetime.timedelta(0, 24, 544198)

(Script: s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb
and ami-version 3.1.0).


-- 
Martin Goodson  |  VP Data Science
(0)20 3397 1240
[image: Inline image 1]

Mime
View raw message