spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Vykhodtsev <yoz...@gmail.com>
Subject pyspark.GroupedData.agg works incorrectly when one column is aggregated twice?
Date Fri, 27 May 2016 11:28:00 GMT
Dear list,

I am trying to calculate sum and count on the same column:

user_id_books_clicks =
(sqlContext.read.parquet('hdfs:///projects/kaggle-expedia/input/train.parquet')
                                  .groupby('user_id')
                                  .agg({'is_booking':'count',
'is_booking':'sum'})
                                  .orderBy(fn.desc('count(user_id)'))
                                  .cache()
                       )

If I do it like that, it only gives me one (last) aggregate -
sum(is_booking)

But if I change to .agg({'user_id':'count', 'is_booking':'sum'})  -  it
gives me both. I am on 1.6.1. Is it fixed in 2.+? Or should I report it to
JIRA?

Mime
View raw message