spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Add column sum as new column in PySpark dataframe
Date Thu, 04 Aug 2016 13:47:12 GMT
sorry you want the sum for each row or sum for each Colum?

assuming all rows are numeric

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 4 August 2016 at 14:41, Javier Rey <jreyro@gmail.com> wrote:

> Hi everybody,
>
> Sorry, I sent last mesage it was imcomplete this is complete:
>
> I'm using PySpark and I have a Spark dataframe with a bunch of numeric
> columns. I want to add a column that is the sum of all the other columns.
>
> Suppose my dataframe had columns "a", "b", and "c". I know I can do this:
>
> df.withColumn('total_col', df.a + df.b + df.c)
>
> The problem is that I don't want to type out each column individually and
> add them, especially if I have a lot of columns. I want to be able to do
> this automatically or by specifying a list of column names that I want to
> add. Is there another way to do this?
>
> I find this solution:
>
> df.withColumn('total', sum(df[col] for col in df.columns))
>
> But I get this error:
>
> "AttributeError: 'generator' object has no attribute '_get_object_id"
>
> Additionally I want to sum onlt not nulls values.
>
> Thanks in advance,
>
> Samir
>

Mime
View raw message