spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Javier Rey <jre...@gmail.com>
Subject Add column sum as new column in PySpark dataframe
Date Thu, 04 Aug 2016 13:41:29 GMT
Hi everybody,

Sorry, I sent last mesage it was imcomplete this is complete:

I'm using PySpark and I have a Spark dataframe with a bunch of numeric
columns. I want to add a column that is the sum of all the other columns.

Suppose my dataframe had columns "a", "b", and "c". I know I can do this:

df.withColumn('total_col', df.a + df.b + df.c)

The problem is that I don't want to type out each column individually and
add them, especially if I have a lot of columns. I want to be able to do
this automatically or by specifying a list of column names that I want to
add. Is there another way to do this?

I find this solution:

df.withColumn('total', sum(df[col] for col in df.columns))

But I get this error:

"AttributeError: 'generator' object has no attribute '_get_object_id"

Additionally I want to sum onlt not nulls values.

Thanks in advance,

Samir

Mime
View raw message