It will be very appreciated if you can give more details about why runJob function could not be called in getPreferredLocations()

In the NewHadoopRDD class and HadoopRDD class, they get the location information from the inputSplit. But there may be an issue in NewHadoopRDD, because it generates all of the inputSplits on the master node, which means I can only use a single node to generate and filter the inputSplits even if the number of inputSplits is huge. Will it be a performance bottleneck?

Thanks,
Fei 





On Fri, Dec 30, 2016 at 10:41 PM, Sun Rui <sunrise_win@163.com> wrote:
You can’t call runJob inside getPreferredLocations().
You can take a look at the source  code of HadoopRDD to help you implement getPreferredLocations() appropriately.

On Dec 31, 2016, at 09:48, Fei Hu <hufei68@gmail.com> wrote:

That is a good idea.

I tried add the following code to get getPreferredLocations() function:

val results: Array[Array[DataChunkPartition]] = context.runJob(
      partitionsRDD, (context: TaskContext, partIter: Iterator[DataChunkPartition]) => partIter.toArray, dd, allowLocal = true)

But it seems to be suspended when executing this function. But if I move the code to other places, like the main() function, it runs well.

What is the reason for it?

Thanks,
Fei

On Fri, Dec 30, 2016 at 2:38 AM, Sun Rui <sunrise_win@163.com> wrote:
Maybe you can create your own subclass of RDD and override the getPreferredLocations() to implement the logic of dynamic changing of the locations.
> On Dec 30, 2016, at 12:06, Fei Hu <hufei68@gmail.com> wrote:
>
> Dear all,
>
> Is there any way to change the host location for a certain partition of RDD?
>
> "protected def getPreferredLocations(split: Partition)" can be used to initialize the location, but how to change it after the initialization?
>
>
> Thanks,
> Fei
>
>