spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bgawalt <>
Subject Re: Efficient implementation of getting top 10 hashtags in last 5 mins window
Date Fri, 16 May 2014 18:06:26 GMT
Hi nilmish,

One option for you is to consider moving to a different algorithm. The
SpaceSaver/StreamSummary method will get you approximate results in exchange
for smaller data structure size. It has an implementation in Twitter's
Algebird library, if you're using Scala:

and has a more general write up here:

I believe it will let you avoid an expensive sort of all the hundreds of
thousands of hashtags you can see in a day.


View this message in context:
Sent from the Apache Spark User List mailing list archive at

View raw message