spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Siebeling <>
Subject Best way to calculate intermediate column statistics
Date Wed, 24 Aug 2016 14:42:26 GMT

what is the best way to calculate intermediate column statistics like the
number of empty values and the number of distinct values each column in a
dataset when aggregating of filtering data next to the actual result of the
aggregate or the filtered data?

We are developing an application in which the user can slice-and-dice
through the data and we would like to, next to the actual resulting data,
get column statistics of each column in the resulting dataset. We prefer to
calculate the column statistics on the same pass over the data as the
actual aggregation or filtering, is that possible?

We could sacrifice a little bit of performance (but not too much), that's
why we prefer one pass...

Is this possible in the standard Spark or would this mean modifying the
source a little bit and recompiling? Is that feasible / wise to do?

thanks in advance,

View raw message