Thanks you all. Just changing RDD to Map  structure saved me approx. 1 second.

Yes, I will check out IndexedRDD to see if it has better performance.


If your dataset is large, there is a Spark Package called IndexedRDD optimized for lookups. Feel free to check that out.


Hi Shahab - if your data structures are small enough a broadcasted Map is going to provide faster lookup. Lookup within an RDD is an O(m) operation where m is the size of the partition. For RDDs with multiple partitions, executors can operate on it in parallel so you get some improvement for larger RDDs.
I am doing lookup on cached RDDs [(Int,String)], and I noticed that the lookup is relatively slow 30-100 ms ?? I even tried this on one machine with single partition, but no difference!

The RDDs are not large at all, 3-30 MB.

Is this expected behaviour? should I use other data structures, like HashMap to keep data and look up it there and use Broadcast to send a copy to all machines?