spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Why is RDD lookup slow?
Date Thu, 19 Feb 2015 15:33:52 GMT
RDDs are not Maps. lookup() does a linear scan -- parallel by
partition, but stil linear. Yes, it is not supposed be an O(1) lookup
data structure. It'd be much nicer to broadcast the relatively small
data set as a Map and look it up fast, locally.

On Thu, Feb 19, 2015 at 3:29 PM, shahab <shahab.mokari@gmail.com> wrote:
> Hi,
>
> I am doing lookup on cached RDDs [(Int,String)], and I noticed that the
> lookup is relatively slow 30-100 ms ?? I even tried this on one machine with
> single partition, but no difference!
>
> The RDDs are not large at all, 3-30 MB.
>
> Is this expected behaviour? should I use other data structures, like HashMap
> to keep data and look up it there and use Broadcast to send a copy to all
> machines?
>
> best,
> /Shahab
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message