spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Carroll <>
Subject Cluster Aware Custom RDD
Date Fri, 16 Jan 2015 21:12:28 GMT
Hello all,

I have a custom RDD for fast loading of data from a non-partitioned source.
The partitioning happens in the RDD implementation by pushing data from the
source into queues picked up by the current active partitions in worker

This works great on a multi-threaded single host (say with the manager set
to "local[x]" ) but I'd like to run it distributed. However, I need to know,
not only which "slice" my partition is, but also which host (by sequence)
it's on so I can divide up the source by worker (host) and then run the
multi-threaded. In other words, I need what effectively amounts to a 2-tier
slice identifier.

I know this is probably unorthodox, but is there some way to get this
information in the compute method or the deserialized Partition objects?

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message