mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: "LLR with time"
Date Sat, 11 Nov 2017 17:31:50 GMT
If Mahout were to use http://bit.ly/poisson-llr it would tend to favor new events in calculating
the LLR score for later use in the threshold for whether a co or cross-occurrence iss incorporated
in the model. This is very interesting and would be useful in cases where you can keep a lot
of data or where recent data is far more important, like news. This is the time-aware G-test
your are referencing as I understand it.

But it doesn’t relate to popularity as I think Ted is saying.

Are you looking for 1) personal recommendations biased by hotness in Greece or 2) things hot
in Greece?

1) create a secondary indicator for “watched in some locale” the local-id uses a country-code+postal-code
maybe but not lat-lon. Something that includes a good number of people/events. The the query
would be user-id, and user-locale. This would yield personal recs preferred in the user’s
locale. Athens-west-side in this case.
2) split the data into locales and do the hot calc I mention. The query would have no user-id
since it is not personalized but would yield “hot in Greece”

Ted’s “Christmas video” tag is what I was calling a business rule and can be added to
either of the above techniques.

On Nov 11, 2017, at 4:01 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

So ... there are a few different threads here.

1) LLR but with time. Quite possible, but not really what Johannes is
talking about, I think. See http://bit.ly/poisson-llr for a quick
discussion.

2) time varying recommendation. As Johannes notes, this can make use of
windowed counts. The problem is that rarely accessed items should probably
have longer windows so that we use longer term trends when we have less
data.

The good news here is that this some part of this is nearly already in the
code. The trick is that the down-sampling used in the system can be adapted
to favor recent events over older ones. That means that if the meaning of
something changes over time, the system will catch on. Likewise, if
something appears out of nowhere, it will quickly train up. This handles
the popular in Greece right now problem.

But this isn't the whole story of changing recommendations. Another problem
that we commonly face is what I call the christmas music issue. The idea is
that there are lots of recommendations for music that are highly seasonal.
Thus, Bing Crosby fans want to hear White Christmas
<https://www.youtube.com/watch?v=P8Ozdqzjigg> until the day after christmas
at which point this becomes a really bad recommendation. To some degree,
this can be partially dealt with by using temporal tags as indicators, but
that doesn't really allow a recommendation to be completely shut down.

The only way that I have seen to deal with this in the past is with a
manually designed kill switch. As much as possible, we would tag the
obviously seasonal content and then add a filter to kill or downgrade that
content the moment it went out of fashion.



On Sat, Nov 11, 2017 at 9:43 AM, Johannes Schulte <
johannes.schulte@gmail.com> wrote:

> Pat, thanks for your help. especially the insights on how you handle the
> system in production and the tips for multiple acyclic buckets.
> Doing the combination signalls when querying sounds okay but as you say,
> it's always hard to find the right boosts without setting up some ltr
> system. If there would be a way to use the hotness when calculating the
> indicators for subpopulations it would be great., especially for a cross
> recommender.
> 
> e.g. people in greece _now_ are viewing this show/product  whatever
> 
> And here the popularity of the recommended item in this subpopulation could
> be overrseen when just looking at the overall derivatives of activity.
> 
> Maybe one could do multiple G-Tests using sliding windows
> * itemA&itemB  vs population (classic)
> * itemA&itemB(t) vs itemA&itemB(t-1)
> ..
> 
> and derive multiple indicators per item to be indexed.
> 
> But this all relies on discretizing time into buckets and not looking at
> the distribution of time between events like in presentation above - maybe
> there is  something way smarter
> 
> Johannes
> 
> On Sat, Nov 11, 2017 at 2:50 AM, Pat Ferrel <pat@occamsmachete.com> wrote:
> 
>> BTW you should take time buckets that are relatively free of daily cycles
>> like 3 day, week, or month buckets for “hot”. This is to remove cyclical
>> affects from the frequencies as much as possible since you need 3 buckets
>> to see the change in change, 2 for the change, and 1 for the event
> volume.
>> 
>> 
>> On Nov 10, 2017, at 4:12 PM, Pat Ferrel <pat@occamsmachete.com> wrote:
>> 
>> So your idea is to find anomalies in event frequencies to detect “hot”
>> items?
>> 
>> Interesting, maybe Ted will chime in.
>> 
>> What I do is take the frequency, first, and second, derivatives as
>> measures of popularity, increasing popularity, and increasingly
> increasing
>> popularity. Put another way popular, trending, and hot. This is simple to
>> do by taking 1, 2, or 3 time buckets and looking at the number of events,
>> derivative (difference), and second derivative. Ranking all items by
> these
>> value gives various measures of popularity or its increase.
>> 
>> If your use is in a recommender you can add a ranking field to all items
>> and query for “hot” by using the ranking you calculated.
>> 
>> If you want to bias recommendations by hotness, query with user history
>> and boost by your hot field. I suspect the hot field will tend to
> overwhelm
>> your user history in this case as it would if you used anomalies so you’d
>> also have to normalize the hotness to some range closer to the one
> created
>> by the user history matching score. I haven’t found a vey good way to mix
>> these in a model so use hot as a method of backfill if you cannot return
>> enough recommendations or in places where you may want to show just hot
>> items. There are several benefits to this method of using hot to rank all
>> items including the fact that you can apply business rules to them just
> as
>> normal recommendations—so you can ask for hot in “electronics” if you
> know
>> categories, or hot "in-stock" items, or ...
>> 
>> Still anomaly detection does sound like an interesting approach.
>> 
>> 
>> On Nov 10, 2017, at 3:13 PM, Johannes Schulte <
> johannes.schulte@gmail.com>
>> wrote:
>> 
>> Hi "all",
>> 
>> I am wondering what would be the best way to incorporate event time
>> information into the calculation of the G-Test.
>> 
>> There is a claim here
>> https://de.slideshare.net/tdunning/finding-changes-in-real-data
>> 
>> saying "Time aware variant of G-Test is possible"
>> 
>> I remember i experimented with exponentially decayed counts some years
> ago
>> and this involved changing the counts to doubles, but I suspect there is
>> some smarter way. What I don't get is the relation to a data structure
> like
>> T-Digest when working with a lot of counts / cells for every combination
> of
>> items. Keeping a t-digest for every combination seems unfeasible.
>> 
>> How would one incorporate event time into recommendations to detect
>> "hotness" of certain relations? Glad if someone has an idea...
>> 
>> Cheers,
>> 
>> Johannes
>> 
>> 
>> 
> 


Mime
View raw message