spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <guha.a...@gmail.com>
Subject Re: Sum array values by row in new column
Date Tue, 16 Aug 2016 10:30:44 GMT
Here is a more generic way of doing this:

from pyspark.sql import Row
df = sc.parallelize([[1,2,3,4],[10,20,30]]).map(lambda x:
Row(numbers=x)).toDF()
df.show()
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
u = udf(lambda c: sum(c), IntegerType())
df1 = df.withColumn("s",u(df.numbers))
df1.show()

On Tue, Aug 16, 2016 at 11:50 AM, Mike Metzger <mike@flexiblecreations.com>
wrote:

> Assuming you know the number of elements in the list, this should work:
>
> df.withColumn('total', df["_1"].getItem(0) + df["_1"].getItem(1) +
> df["_1"].getItem(2))
>
> Mike
>
> On Mon, Aug 15, 2016 at 12:02 PM, Javier Rey <jreyro@gmail.com> wrote:
>
>> Hi everyone,
>>
>> I have one dataframe with one column this column is an array of numbers,
>> how can I sum each array by row a obtain a new column with sum? in pyspark.
>>
>> Example:
>>
>> +------------+
>> |     numbers|
>> +------------+
>> |[10, 20, 30]|
>> |[40, 50, 60]|
>> |[70, 80, 90]|
>> +------------+
>>
>> The idea is obtain the same df with a new column with totals:
>>
>> +------------+------
>> |     numbers|     |
>> +------------+------
>> |[10, 20, 30]|60   |
>> |[40, 50, 60]|150  |
>> |[70, 80, 90]|240  |
>> +------------+------
>>
>> Regards!
>>
>> Samir
>>
>>
>>
>>
>


-- 
Best Regards,
Ayan Guha

Mime
View raw message