spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject Re: How to create RDD over hashmap?
Date Fri, 24 Jan 2014 21:36:21 GMT
By my reading of the code, it uses the partitioner to decide which worker
the key lands on, then does an O(N) scan of that partition.  I think we're
saying the same thing.

https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L549


On Fri, Jan 24, 2014 at 1:26 PM, Cheng Lian <rhythm.mail@gmail.com> wrote:

> PairRDDFunctions.lookup is good enough in Spark, it's just that its time
> complexity is O(N).  Of course, for RDDs equipped with a partitioner, N is
> the average size of a partition.
>
>
> On Sat, Jan 25, 2014 at 5:16 AM, Andrew Ash <andrew@andrewash.com> wrote:
>
>> If you have a pair RDD (an RDD[A,B]) then you can use the .lookup()
>> method on it for faster access.
>>
>>
>> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions
>>
>> Spark's strength is running computations across a large set of data.  If
>> you're trying to do fast lookup of a few individual keys, I'd recommend
>> something more like memcached or Elasticsearch.
>>
>>
>> On Fri, Jan 24, 2014 at 1:11 PM, Manoj Samel <manojsameltech@gmail.com>wrote:
>>
>>> Yes, that works.
>>>
>>> But then the hashmap functionality of the fast key lookup etc. is gone
>>> and the search will be linear using a iterator etc. Not sure if Spark
>>> internally creates additional optimizations for Seq but otherwise one has
>>> to assume this becomes a List/Array without a fast key lookup of a hashmap
>>> or b-tree
>>>
>>> Any thoughts ?
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Jan 24, 2014 at 1:00 PM, Frank Austin Nothaft <
>>> fnothaft@berkeley.edu> wrote:
>>>
>>>> Manoj,
>>>>
>>>> I assume you’re trying to create an RDD[(String, Double)]? Couldn’t you
>>>> just do:
>>>>
>>>> val cr_rdd = sc.parallelize(cr.toSeq)
>>>>
>>>> The toSeq would convert the HashMap[String,Double] into a Seq[(String,
>>>> Double)] before calling the parallelize function.
>>>>
>>>> Regards,
>>>>
>>>> Frank Austin Nothaft
>>>> fnothaft@berkeley.edu
>>>> fnothaft@eecs.berkeley.edu
>>>> 202-340-0466
>>>>
>>>> On Jan 24, 2014, at 12:56 PM, Manoj Samel <manojsameltech@gmail.com>
>>>> wrote:
>>>>
>>>> > Is there a way to create RDD over a hashmap ?
>>>> >
>>>> > If I have a hash map and try sc.parallelize, it gives
>>>> >
>>>> > <console>:17: error: type mismatch;
>>>> >  found   : scala.collection.mutable.HashMap[String,Double]
>>>> >  required: Seq[?]
>>>> > Error occurred in an application involving default arguments.
>>>> >        val cr_rdd = sc.parallelize(cr)
>>>> >                                    ^
>>>>
>>>>
>>>
>>
>

Mime
View raw message