samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Uwe Dauernheim <...@dauernheim.net>
Subject Re: Modeling charts
Date Wed, 18 Feb 2015 09:32:27 GMT
Thanks Fang, will do some read up.
/Uwe


On Tue, Feb 17, 2015 at 11:01 PM, Yan Fang <yanfang724@gmail.com> wrote:
> Hi Uwe,
>
> Your use case seems to me is more like a state-management case. What comes
> to my mind is that,
> 1) every time a song is played, you updates the count of this song. You do
> not put the map in memory, as you said, the memory could be quite large.
> Instead, you use Samza's build-in key-value storage. ( you do all this in
> process method )
>
> 2) you scan the whole key-value DB every, say, one hour. ( you do all this
> in window method)
>
> * This could provide better fault-tolerance ( for example, your machine is
> down during the one hour. you will not lose any count number by restoring
> the key-value DB)
>
> Some relevant links:
> *
> http://samza.apache.org/learn/documentation/0.8/container/state-management.html#windowed-aggregation
> *
> http://samza.apache.org/learn/documentation/0.8/container/state-management.html#approaches-to-managing-task-state
> *
> http://samza.apache.org/learn/documentation/0.8/container/state-management.html#key-value-storage
>
> Hope this helps.
>
> Cheers,
>
> Fang, Yan
> yanfang724@gmail.com
> +1 (206) 849-4108
>
> On Tue, Feb 17, 2015 at 11:35 AM, Uwe Dauernheim <uwe@dauernheim.net> wrote:
>
>> I try to model a music charts system to get familiar with Samza.
>> Charts are defined by the top N entries with highest count of a map
>> from unique track ID, basically a song, to counter, basically the
>> amount of plays of this entity, during a sliding time-window.
>>
>> The problem I see is that of an evergrowing size of this map as the ID
>> space of tracks can be quite large (let's pick 2E6). Not all of these
>> IDs will be played (thus should be counted) within a given time-window
>> (let's pick 1 hour) but it's not obvious to me when to prune the map
>> during this sliding time-window.
>>
>> I assume dealing with sliding time-windows is a common case for stream
>> processing thus some useful API provided by Samza. Does an example or
>> tutorial for this kind of sliding time-window counting example exist?
>>

Mime
View raw message