spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Biplob Biswas <revolutioni...@gmail.com>
Subject Re: Replacing groupBykey() with reduceByKey()
Date Mon, 06 Aug 2018 08:20:45 GMT
Hi Santhosh,

If you are not performing any aggregation, then I don't think you can
replace your groupbykey with a reducebykey, and as I see you are only
grouping and taking 2 values of the result, thus I believe you can't just
replace your groupbykey with that.

Thanks & Regards
Biplob Biswas


On Sat, Aug 4, 2018 at 12:05 AM Bathi CCDB <bathi.ccdb@gmail.com> wrote:

> I am trying to replace groupByKey() with reudceByKey(), I am a pyspark
> and python newbie and I am having a hard time figuring out the lambda
> function for the reduceByKey() operation.
>
> Here is the code
>
> dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x: (x[0],x)).groupByKey(25).take(2)
>
> Here is the return value
>
> >>> dd[(u'KEY_1', <pyspark.resultiterable.ResultIterable object at 0x107be0c50>),
(u'KEY_2', <pyspark.resultiterable.ResultIterable object at 0x107be0c10>)]
>
> and Here are the iterable contents dd[0][1]
>
> Row(key=u'KEY_1', hash_fn=u'deec95d65ca6b3b4f2e1ef259040aa79', value=u'e7dc1f2a')Row(key=u'KEY_1',
hash_fn=u'f8891048a9ef8331227b4af080ecd28a', value=u'fb0bc953').......Row(key=u'KEY_1', hash_fn=u'1b9d2bb2db28603ff21052efcd13f242',
value=u'd39714d3')Row(key=u'KEY_1', hash_fn=u'c41b0269706ac423732a6bab24bf8a6a', value=u'ab58db92')
>
> My question is how do replace with reduceByKey() and get the same output
> as above?
>
> Santhosh
>

Mime
View raw message