Just an update on this - I have benchmarked on a cluster built with spark-ec2 and again found that reading from hdfs is not much faster than from s3 (about 20%).

Does anyone know how to check that data locality is being used by spark on my cluster? 

Is it surprising that access to HDFS on local disks is not much faster than accessing s3 across the network?






-- 
Martin Goodson  |  VP Data Science
(0)20 3397 1240  
Inline image 1


On Fri, Aug 1, 2014 at 10:44 AM, Martin Goodson <martin@skimlinks.com> wrote:
Hi all,
I'm consistently finding that reading from HDFS is not appreciably faster than reading from S3 using pyspark. How can I tell whether data locality is being respected?

In this example, reading from HDFS is only about 10% faster than reading the same file from S3. The files were pulled from s3 using S3distcp. (The file size is slightly smaller on HDFS but lets ignore that for now). This was run on an EMR cluster but I have found the same effect using the spark-ec2 script. 


pageshdfs=sc.textFile('hdfs:///pages/year=2014/month=05/day=01/hour=0000/*')
pagess3=sc.textFile('s3n://BUCKETNAME/pages/year=2014/month=05/day=01/hour=0000/*')

t=datetime.now(); pageshdfs.count(); datetime.now()-t
5056418
datetime.timedelta(0, 22, 123156)


t=datetime.now(); pagess3.count(); datetime.now()-t
5324499
datetime.timedelta(0, 24, 544198)

(Script: s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb   and ami-version 3.1.0).


-- 
Martin Goodson  |  VP Data Science
(0)20 3397 1240  
Inline image 1