spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From the3rdNotch <>
Subject How to calculate standard deviation of grouped data in a DataFrame?
Date Mon, 03 Aug 2015 14:30:35 GMT
I have user logs that I have taken from a csv and converted into a DataFrame
in order to leverage the SparkSQL querying features.  A single user will
create numerous entries per hour, and I would like to gather some basic
statistical information for each user; really just the count of the user
instances, the average, and the standard deviation of numerous columns.  I
was able to quickly get the mean and count information by using
groupBy($"user") and the aggregator with SparkSQL functions for count and

*val meanData = selectedData.groupBy($"user").agg(count($"logOn"),
      avg($"submit"), avg($"submitsPerHour"), avg($"replies"),
avg($"repliesPerHour"), avg($"duration"))*

However, I cannot seem to find an equally elegant way to calculate the
standard deviation.  So far I can only calculate it by mapping a string,
double pair and use StatCounter().stdev utility:

*val stdevduration = duration.groupByKey().mapValues(value =>

This returns an RDD however, and I would like to try and keep it all in a
DataFrame for further queries to be possible on the returned data.  Is there
a similarly simplistic method the calculating the standard deviation like
there is for the mean and count?

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message