spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shiva Prashanth Vallabhaneni <shiva.vallabhan...@tavant.com>
Subject RE: Multiple column aggregations
Date Mon, 11 Feb 2019 13:23:09 GMT
Hi Sonu,

You could use a query that is similar to the below one. You could further optimize the below
query by adding a WHERE clause. I would suggest that you benchmark the performance of both
approaches (multiple group-by queries vs single query with multiple window functions), before
choosing one of these options. Before running the benchmark, I would ensure that the underlying
data is stored in a columnar storage format with compression enabled. For instance, you could
use parquet file format with block-level compression using Snappy.

SELECT  SUM(CASE WHEN accountRank =2 THEN 1 ELSE 0 END) AS accountsWithMoreThanOneOrder,
SUM(CASE WHEN orderRank =2 THEN 1 ELSE 0 END) AS ordersWithMoreThanOneAccount,
FROM   (
                  SELECT  accountNo,
                               orderNo,
  rank() OVER (PARTITION BY orderNo ORDER BY accountNo) AS orderRank,
 rank() OVER (PARTITION BY accountNo ORDER BY orderNo) AS accountRank
                  FROM   accountOrders
                )

P.S – You will need to check the above query for any syntax errors.

– Shiva

From: Sonu Jyotshna <sonu.jyotshna@gmail.com>
Sent: Saturday, February 9, 2019 10:17 AM
To: user@spark.apache.org
Subject: Multiple column aggregations


Hello,

I have a requirement where I need to group by multiple columns and aggregate them not at same
time .. I mean I have a structure which contains accountid, some cols, order id . I need to
calculate some scenarios like account having multiple orders so group by account and aggregate
will work here but I need to find orderid associated to multiple accounts so may be group
by orderid will work here but for better performance on the dataset level can we do in single
step? Where both will work or any better approach I can follow . Can you help


Regards,
Sonu
________________________________
Any comments or statements made in this email are not necessarily those of Tavant Technologies.
The information transmitted is intended only for the person or entity to which it is addressed
and may contain confidential and/or privileged material. If you have received this in error,
please contact the sender and delete the material from any computer. All emails sent from
or to Tavant Technologies may be subject to our monitoring procedures.
Mime
View raw message