spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raajen Patel <raa...@gmail.com>
Subject Re: Spark driver assigning splits to incorrect workers
Date Mon, 04 Jul 2016 15:07:42 GMT
Hi Ted,

Perhaps this might help? Thanks for your response. I am trying to
access/read binary files stored over a series of servers.

Line used to build RDD:
val BIN_pairRDD: RDD[(BIN_Key, BIN_Value)]  =
spark.newAPIHadoopFile("not.used", classOf[BIN_InputFormat],
classOf[BIN_Key], classOf[BIN_Value], config);

In order to support this, we have the following custom classes:
- BIN_Key and BIN_Value as the paired entry for the RDD
- BIN_RecordReader and BIN_FileSplit to handle the special splits
- BIN_FileSplit overrides getLocations() and getLocationInfo(), and we have
verified that the right IP address is being sent to Spark.
- BIN_InputFormat queries a database for details about every split to be
created; as in, which file to read and the IP address where that file is
local.

When it works:
- No problems running a local job
- No problems running in a cluster when there is 1 computer as Master and
another computer with 3 workers along with the files to process.

When it fails:
- When running in a cluster with multiple workers and files spread across
multiple computers. Jobs are not assigned to the nodes where the files are
local.

Thanks,
Raajen

Mime
View raw message