spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bgawalt <bgaw...@gmail.com>
Subject Re: Efficient implementation of getting top 10 hashtags in last 5 mins window
Date Fri, 16 May 2014 18:06:26 GMT
Hi nilmish,

One option for you is to consider moving to a different algorithm. The
SpaceSaver/StreamSummary method will get you approximate results in exchange
for smaller data structure size. It has an implementation in Twitter's
Algebird library, if you're using Scala:

https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/SpaceSaver.scala

and has a more general write up here:

http://boundary.com/blog/2013/05/14/approximate-heavy-hitters-the-spacesaving-algorithm/

I believe it will let you avoid an expensive sort of all the hundreds of
thousands of hashtags you can see in a day.

Best,
--Brian



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-implementation-of-getting-top-10-hashtags-in-last-5-mins-window-tp5741p5845.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Mime
View raw message