spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From K Koh <>
Subject Efficient way to aggregate event data at daily/weekly/monthly level
Date Thu, 03 Apr 2014 00:22:12 GMT

I want to aggregate (time-stamped) event data at daily, weekly and monthly
level stored in a directory in  data/yyyy/mm/dd/dat.gz format. For example:

Each dat.gz file contains tuples in (datetime, id, value) format. I can
perform aggregation as follows:

but this code doesn't seem to be efficient because it doesn't exploit the
data dependencies in reduce steps. For example, there is no dependency in
between the data of 2010-01 and that of all other dates (2010-02, 2010-03,
...) so it would be ideal if we can load one month of data in each node once
and perform all three (daily, weekly and monthly) aggregation.

I think I could use mapPartitions and a big reducer that performs all three
aggregations, but not sure it is a right way to go.

Is there a more efficient way to perform these aggregations (by loading data
once) yet keeping the code modular?


View this message in context:
Sent from the Apache Spark User List mailing list archive at

View raw message