How would I create a PairRDD ?

On Fri, Jan 24, 2014 at 1:54 PM, Ta= thagata Das wrote:
On this note, you can do so= mething smarter that the basic lookup function. You could convert each part= ition of the key-value pair RDD into a hashmap using something like

val rddOfHashmaps =3D pairRDD.mapPartitions(iterator =3D>= {
=A0 =A0val hashmap =3D new HashMap[String, ArrayBuffer[Double]]
<= div>=A0 =A0iterator.foreach { case (key, value} =A0=3D> hashmap.getOrEls= eUpdate(key, new ArrayBuffer[Double]) +=3D value
=A0 =A0Iterator(= hashmap)
=A0}, preserveParitioning =3D true)

And then you c= an do a variation of the lookup function to lookup the right partiti= on, and then within that partition directly lookup the hashmap and return t= he value (rather than scanning the whole partition). That give practically = O(1) lookup time instead of O(N). But i doubt it will match something that = a dedicated lookup system like memcached would achieve.

TD

<= br>
On Fri, Jan 24, 2014 at 1:36 PM, Andrew Ash <= span dir=3D"ltr"><andrew@andrewash.com> wrote:
By my reading of the code, = it uses the partitioner to decide which worker the key lands on, then does = an O(N) scan of that partition. =A0I think we're saying the same thing.=

On Fri, Jan 24, 2014 at 1:26 PM, Cheng Lian wrote:
PairRDDFunctions.lookup is = good enough in Spark, it's just that its time complexity is O(N). =A0Of= course, for RDDs equipped with a partitioner, N is the average size of a p= artition.

On Sat, Jan 25, 2014 at 5:16 AM, Andrew = Ash wrote:

On Fri, Jan 24, 2014 at 1:11 PM, Manoj Samel <= ;manojsamelte= ch@gmail.com> wrote:
Yes, that works.

But then the hashmap functionality of the fast key lookup etc. is = gone and the search will be linear using a iterator etc. Not sure if Spark = internally creates additional optimizations for Seq but otherwise one has t= o assume this becomes a List/Array without a fast key lookup of a hashmap o= r b-tree=A0

Any thoughts ?

=

On Fri, Jan 24, 2014 at 1:00 PM, Frank Austin Nothaft wrote:
Manoj,

I assume you=92re trying to create an RDD[(String, Double)]? Couldn=92t you= just do:

val cr_rdd =3D sc.parallelize(cr.toSeq)

The toSeq would convert the HashMap[String,Double] into a Seq[(String, Doub= le)] before calling the parallelize function.

Regards,

Frank Austin Nothaft
fnothaft@berkele= y.edu
fnothaft@ee= cs.berkeley.edu
202-3= 40-0466

On Jan 24, 2014, at 12:56 PM, Manoj Samel <manojsameltech@gmail.com> wrote:
> Is there a way to create RDD over a hashmap ?
>
> If I have a hash map and try sc.parallelize, it gives
>
> <console>:17: error: type mismatch;
> =A0found =A0 : scala.collection.mutable.HashMap[String,Double]
> =A0required: Seq[?]
> Error occurred in an application involving default arguments.
> =A0 =A0 =A0 =A0val cr_rdd =3D sc.parallelize(cr)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0^

--001a11c2c97442d59804f0bece27--