spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Javier Rey <jre...@gmail.com>
Subject Re: Sum array values by row in new column
Date Tue, 16 Aug 2016 19:22:08 GMT
Hi, Thanks!!

this works, but I also need mean :) I am finding way.

Regards.

2016-08-16 5:30 GMT-05:00 ayan guha <guha.ayan@gmail.com>:

> Here is a more generic way of doing this:
>
> from pyspark.sql import Row
> df = sc.parallelize([[1,2,3,4],[10,20,30]]).map(lambda x:
> Row(numbers=x)).toDF()
> df.show()
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType
> u = udf(lambda c: sum(c), IntegerType())
> df1 = df.withColumn("s",u(df.numbers))
> df1.show()
>
> On Tue, Aug 16, 2016 at 11:50 AM, Mike Metzger <mike@flexiblecreations.com
> > wrote:
>
>> Assuming you know the number of elements in the list, this should work:
>>
>> df.withColumn('total', df["_1"].getItem(0) + df["_1"].getItem(1) +
>> df["_1"].getItem(2))
>>
>> Mike
>>
>> On Mon, Aug 15, 2016 at 12:02 PM, Javier Rey <jreyro@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I have one dataframe with one column this column is an array of numbers,
>>> how can I sum each array by row a obtain a new column with sum? in pyspark.
>>>
>>> Example:
>>>
>>> +------------+
>>> |     numbers|
>>> +------------+
>>> |[10, 20, 30]|
>>> |[40, 50, 60]|
>>> |[70, 80, 90]|
>>> +------------+
>>>
>>> The idea is obtain the same df with a new column with totals:
>>>
>>> +------------+------
>>> |     numbers|     |
>>> +------------+------
>>> |[10, 20, 30]|60   |
>>> |[40, 50, 60]|150  |
>>> |[70, 80, 90]|240  |
>>> +------------+------
>>>
>>> Regards!
>>>
>>> Samir
>>>
>>>
>>>
>>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Mime
View raw message