spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raajen <raa...@gmail.com>
Subject Spark driver assigning splits to incorrect workers
Date Fri, 01 Jul 2016 21:46:22 GMT
I would like to use Spark on a non-distributed file system but am having
trouble getting the driver to assign tasks to the workers that are local to
the files. I have extended InputSplits to create my own version of
FileSplits, so that each worker gets a bit more information than the default
FileSplit provides. I thought that the driver would assign splits based on
their locality. But I have found that the driver will send these splits to
workers seemingly at random -- even the very first split will go to a node
with a different IP than the split specifies. I can see that I am providing
the right node address via GetLocations. I also set spark.locality.wait to a
high value, but the same misassignment keeps happening.

I am using newAPIHadoopFile to create my RDD. InputFormat is creating the
required splits, but not all splits refer to the same file or the same
worker IP. 

What else I can check, or change, to force the driver to send these tasks to
the right workers?

Thanks!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-driver-assigning-splits-to-incorrect-workers-tp27261.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message