Thanks you all. Just changing RDD to Map  structure saved me approx. 1 second.

Yes, I will check out IndexedRDD to see if it has better performance.


On Thu, Feb 19, 2015 at 6:38 PM, Burak Yavuz <> wrote:

If your dataset is large, there is a Spark Package called IndexedRDD optimized for lookups. Feel free to check that out.


On Feb 19, 2015 7:37 AM, "Ilya Ganelin" <> wrote:
Hi Shahab - if your data structures are small enough a broadcasted Map is going to provide faster lookup. Lookup within an RDD is an O(m) operation where m is the size of the partition. For RDDs with multiple partitions, executors can operate on it in parallel so you get some improvement for larger RDDs.
On Thu, Feb 19, 2015 at 7:31 AM shahab <> wrote:

I am doing lookup on cached RDDs [(Int,String)], and I noticed that the lookup is relatively slow 30-100 ms ?? I even tried this on one machine with single partition, but no difference!

The RDDs are not large at all, 3-30 MB.

Is this expected behaviour? should I use other data structures, like HashMap to keep data and look up it there and use Broadcast to send a copy to all machines?