spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lsn24 <>
Subject Spark Sql group by less performant
Date Tue, 11 Dec 2018 00:28:20 GMT

 I have a requirement where I need to get total count of rows and total
count of failedRows based on a grouping.

The code looks like below:


Dataset <Row> countDataset = sparkSession.sql("Select
column1,column2,column3,column4,column5,column6,column7,column8, count(*) as
totalRows, sum(CASE WHEN (column8 is NULL) THEN 1 ELSE 0 END) as failedRows 
from temp_view group by

Up till around 50 Million records,  the query performance was ok. After that
it gave it up. Mostly resulting in out of Memory exception.

I read documentation and blogs, most of them gives me examples of
RDD.reduceByKey. But here I got dataset and spark Sql.

What  am I missing here ? .

Any help will be appreciated.


Sent from:

To unsubscribe e-mail:

View raw message