If your dataset is large, there is a Spark Package called IndexedRDD optimized for lookups. Feel free to check that out.
BurakOn Feb 19, 2015 7:37 AM, "Ilya Ganelin" <firstname.lastname@example.org> wrote:Hi Shahab - if your data structures are small enough a broadcasted Map is going to provide faster lookup. Lookup within an RDD is an O(m) operation where m is the size of the partition. For RDDs with multiple partitions, executors can operate on it in parallel so you get some improvement for larger RDDs.On Thu, Feb 19, 2015 at 7:31 AM shahab <email@example.com> wrote:Hi,I am doing lookup on cached RDDs [(Int,String)], and I noticed that the lookup is relatively slow 30-100 ms ?? I even tried this on one machine with single partition, but no difference!The RDDs are not large at all, 3-30 MB.Is this expected behaviour? should I use other data structures, like HashMap to keep data and look up it there and use Broadcast to send a copy to all machines?best,/Shahab