spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sunil <>
Subject Data locality with HDFS not being seen
Date Thu, 20 Aug 2015 11:09:16 GMT
Hello .....      I am seeing some unexpected issues with achieving HDFS data
locality. I expect the tasks to be executed only on the node which has the
data but this is not happening (ofcourse, unless the node is busy in which
case, I understand tasks can go to some other node). Could anyone clarify
whats wrong with the way I am trying or what I should rather do? Below is
the cluster configuration and experiments that I have tried. Any help will
be appreciated. If you would like to recreate the below scenario, then you
may use the example given within the spark.

*Cluster configuration:*

1. spark-1.4.0 and hadoop-2.7.1
2. Machines --> Master node (master) and 6 worker nodes (node1 to node6) 
3. master acts as --> spark master, HDFS name node & sec name node, Yarn
resource manager
4. Each of the 6 worker nodes act as --> spark worker node, HDFS data node,
node manager

*Data on HDFS:*

20Mb text file is stored in single block. With the replication factor of 3,
the text file is stored on nodes 2, 3 & 4.

*Test-1 (Spark stand alone mode):*

Application being run is the standard Java word count count example with the
above text file in HDFS, as input. On job submission, I see in the spark
web-UI that, stage-0(i.e mapToPair) is being run on random nodes (i.e.
node1, node 2, node 6, etc.). By random I mean that, stage 0 executes on the
very first worker node that gets registered to the application (this can be
looked from the event timeline graph). Rather, I am expecting the stage-0 to
be run only on any one of the three nodes 2, 3, or 4. 

* Test-2 (Yarn cluster mode): *
Same as above. No data locality seen. 

* Additional info: *
No other spark applications are running and I have even tried by setting the
/spark.locality.wait/ to 10s, but still no difference.

Thanks and regards,

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message