I have similar requirememt，take top N by key. right now I use groupByKey，but one key would group more than half data in some dataset.
Yours, Xuefeng Wu 吴雪峰 敬上
I think it would depend on the type and amount of information you're collecting.
If you're just trying to collect small numbers for each window, and don't have an overwhelming number of windows, you might consider using accumulators. Just make one per value per time window, and for each data point, add it to the accumulators for the time windows in which it belongs. We've found this approach a lot faster than anything involving a shuffle. This should work fine for stuff like max(), min(), and mean()
If you're collecting enough data that accumulators are impractical, I think I would try multiple passes. Cache your data, and for each pass, filter to that window, and perform all your operations on the filtered RDD. Because of the caching, it won't be significantly slower than processing it all at once - in fact, it will probably be a lot faster, because the shuffles are shuffling less information. This is similar to what you're suggesting about partitioning your rdd, but probably simpler and easier.
That being said, your restriction 3 seems to be in contradiction to the rest of your request - if your aggregation needs to be able to look at all the data at once, then that seems contradictory to viewing the data through an RDD. Could you explain a bit more what you mean by that?