spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vasu Gourabathina <>
Subject Design aspects of Data partitioning for Window functions
Date Wed, 30 Aug 2017 19:33:48 GMT

If this question was already discussed, please let me know. I can try to
look into the archive.

Data Characteristics:
    entity_id  date  fact_1 fact_2  fact_N   derived_1  derived_2  derived_X

a) There are 1000s of such entities in the system
b) Each one has various Fact attributes per each day (to begin with). In
future, we wanted to support multiple entries per day
c) Goal is to calculate various Derived attributes...some of them are
Windows functions, such as Average, Moving Average etc
d) The total number of rows per each entity might not be equally

1) What's the best way to partition the data for better performance
optimization? Any things to consider given point #d above?

Sample code:
The following code seems to work fine on a smaller sample size:
      window =
Window.partitionBy('entity_id').orderBy('date').rowsBetween(-30, 0)
      moving_avg = mean(df['fact_1']).over(window)
      df2 = df.withColumn('derived_moving_avg', moving_avg)

Please advise if there are any aspects that need to be considered to make
it efficient to run on a larger data size (with N-node spark cluster).

Thanks in advance,

View raw message