spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Spark User <>
Subject Question about single/multi-pass execution in Spark-2.0 dataset/dataframe
Date Tue, 27 Sep 2016 20:02:25 GMT
case class Record(keyAttr: String, attr1: String, attr2: String, attr3:

val ds = sparkSession.createDataset(rdd).as[Record]

val attr1Counts = ds.groupBy('keyAttr', 'attr1').count()

val attr2Counts = ds.groupBy('keyAttr', 'attr2').count()

val attr3Counts = ds.groupBy('keyAttr', 'attr3').count()

//similar counts for 20 attributes

//code to merge attr1Counts and attr2Counts and attr3Counts
//translate it to desired output format and save the result.

Some more details:
1) The application is a spark streaming application with batch interval in
the order of 5 - 10 mins
2) Data set is large in the order of millions of records per batch
3) I'm using spark 2.0

The above implementation doesn't seem to be efficient at all, if data set
goes through the Rows for every count aggregation for computing
attr1Counts, attr2Counts and attr3Counts. I'm concerned about the

1) Does the catalyst optimization handle such queries and does a single
pass on the dataset under the hood?
2) Is there a better way to do such aggregations , may be using UDAFs? Or
it is better to do RDD.reduceByKey for this use case?
RDD.reduceByKey performs well for the data and batch interval of 5 - 10
mins. Not sure if data set implementation as explained above will be
equivalent or better.


View raw message