mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Johannes Schulte <>
Subject Re: "LLR with time"
Date Sat, 11 Nov 2017 08:43:19 GMT
Pat, thanks for your help. especially the insights on how you handle the
system in production and the tips for multiple acyclic buckets.
Doing the combination signalls when querying sounds okay but as you say,
it's always hard to find the right boosts without setting up some ltr
system. If there would be a way to use the hotness when calculating the
indicators for subpopulations it would be great., especially for a cross

e.g. people in greece _now_ are viewing this show/product  whatever

And here the popularity of the recommended item in this subpopulation could
be overrseen when just looking at the overall derivatives of activity.

Maybe one could do multiple G-Tests using sliding windows
 * itemA&itemB  vs population (classic)
 * itemA&itemB(t) vs itemA&itemB(t-1)

and derive multiple indicators per item to be indexed.

But this all relies on discretizing time into buckets and not looking at
the distribution of time between events like in presentation above - maybe
there is  something way smarter


On Sat, Nov 11, 2017 at 2:50 AM, Pat Ferrel <> wrote:

> BTW you should take time buckets that are relatively free of daily cycles
> like 3 day, week, or month buckets for “hot”. This is to remove cyclical
> affects from the frequencies as much as possible since you need 3 buckets
> to see the change in change, 2 for the change, and 1 for the event volume.
> On Nov 10, 2017, at 4:12 PM, Pat Ferrel <> wrote:
> So your idea is to find anomalies in event frequencies to detect “hot”
> items?
> Interesting, maybe Ted will chime in.
> What I do is take the frequency, first, and second, derivatives as
> measures of popularity, increasing popularity, and increasingly increasing
> popularity. Put another way popular, trending, and hot. This is simple to
> do by taking 1, 2, or 3 time buckets and looking at the number of events,
> derivative (difference), and second derivative. Ranking all items by these
> value gives various measures of popularity or its increase.
> If your use is in a recommender you can add a ranking field to all items
> and query for “hot” by using the ranking you calculated.
> If you want to bias recommendations by hotness, query with user history
> and boost by your hot field. I suspect the hot field will tend to overwhelm
> your user history in this case as it would if you used anomalies so you’d
> also have to normalize the hotness to some range closer to the one created
> by the user history matching score. I haven’t found a vey good way to mix
> these in a model so use hot as a method of backfill if you cannot return
> enough recommendations or in places where you may want to show just hot
> items. There are several benefits to this method of using hot to rank all
> items including the fact that you can apply business rules to them just as
> normal recommendations—so you can ask for hot in “electronics” if you know
> categories, or hot "in-stock" items, or ...
> Still anomaly detection does sound like an interesting approach.
> On Nov 10, 2017, at 3:13 PM, Johannes Schulte <>
> wrote:
> Hi "all",
> I am wondering what would be the best way to incorporate event time
> information into the calculation of the G-Test.
> There is a claim here
> saying "Time aware variant of G-Test is possible"
> I remember i experimented with exponentially decayed counts some years ago
> and this involved changing the counts to doubles, but I suspect there is
> some smarter way. What I don't get is the relation to a data structure like
> T-Digest when working with a lot of counts / cells for every combination of
> items. Keeping a t-digest for every combination seems unfeasible.
> How would one incorporate event time into recommendations to detect
> "hotness" of certain relations? Glad if someone has an idea...
> Cheers,
> Johannes

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message