mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject using root LLR
Date Tue, 15 Nov 2016 20:21:46 GMT
I understand the eyeball method but not sure users will so am working on a t-digest calculation
of an LLR threshold. This is to maintain a certain sparsity at maximum “quality”. But
I have a few questions.

You mention root LLR, ok but that will create negative numbers. I assume:

1) we should use the absolute value of root LLR for ranking in the max # of indicators sense.
Seems like no value in creating the sqrt( | rootLLR | ) since the rank will not change but
we can’t just use the value returned by the java root LL function directly
2) Likewise we use the absolute value of root LLR to compare with the threshold. Put another
way without using absolute value the value passes the LLR threshold test if mean - threshold
< value < mean + threshold
3) However the positive and negative root LLR values would be used in the t-digest quantile
calc, which ideally would have mean = 0.

Seems simple but just checking my understanding, are these correct?

On Jan 2, 2016, at 3:17 PM, Ted Dunning <> wrote:

I usually like to use a combination of a fixed threshold for llr plus a max number of indicators.

The fixed threshold I use is typically around 20-30 for raw LLR which corresponds to about
5 for root LLR. I often eyeball the lists of indicators for items that I understand to find
a point where the list of indicators becomes about half noise, half useful indicators.

On Sat, Jan 2, 2016 at 2:15 PM, Pat Ferrel < <>>
One interesting thing we saw is that like-genre was better discarded and dislike-genre left
in the mix.

This brings up a fundamental issue with how we use LLR to downsample in Mahout. In this case
by downsampling I mean llr(A’B), where we keep some max number of indicators based on the
best LLR score. For the primary action—something like “buy”—this works well since
there are usually quite a lot of items, but for B there may be very few items, genre is an
example. Using the same max # of indicators for A’A as well as all the rest (A’B, etc)
 means that very little if any downsampling based on LLR score happens for A’B. So for A’B
the result is really more like simple cross-cooccurrence.

This seems worth addressing, if only because in our analysis the effect made like-genre useless,
when intuition would say that it should be useful. Our hypothesis is that since no downsampling
happened and very many of the reviewers preferred most all of the genres it had no differentiating
value. If we had changed the per item max indicators to some smaller number this might have
left only strongly correlated like-genre indicators.

Assuming I’ve got the issue correctly identified the options I can think of are:
1) use a fixed number LLR threshold for A’B or other cross-cooccurrence indicator. This
seems pretty impractical. 
2) add a max indicators threshold param for each of the secondary indicators. This would be
fairly easy and could be based on the # of B items. Some method of choosing this might end
up being ~100 for A’A (the default), and a function of the # of items in B, C, etc. The
plus is that this would be easy and keep the calculation at O(n) but the function that return
100 for A, and some smaller number for B, C, and the rest is not clear to me.
3) create a threshold based on the distribution of llr(A’B). This could be based on a correlation
confidence (actually confidence of non-correlation for LLR). The down side is that this means
we need to calculate all of llr(A’B) which approaches O(n^2) then do the downsampling of
the complete llr(A’B). This removes the rather significant practical benefit of the current
downsampling algorithm. Practically speaking most indicators will be of dimensionality on
the order of # of A items or will be very very much smaller, like # of genre’s. So maybe
calculating the distribution of llr(A’B) wouldn’t bee to bad if only done when B has a
small number of items. In the small B case it would be O(n*m) where m is the number of items
in B and n is the number or items in A and m << n so this would nearly be O(n). Also
this could be mixed with #2 and only calculated every so often since it probably won’t change
very much in any one application.

I guess I’d be inclined to test by trying a range of max # of indicators on our test data
since the number of genre’s are small. If there is any place that produces significantly
better results we could proceed to try the confidence method and see if it allows us to calculate
the optimal #. If so them we could implement this for very occasional calculation on live

Any advice?

> On Dec 30, 2015, at 2:26 PM, Ted Dunning < <>>
> This is really nice work!
> On Wed, Dec 30, 2015 at 11:50 AM, Pat Ferrel < <>>
> As many of you know Mahout-Samsara includes an interesting and important extension to
cooccurrence similarity, which supports cross-coossurrence and log-likelihood downsampling.
This, when combined with a search engine, gives us a multimodal recommender. Some of us integrated
Mahout with a DB and search engine to create what we call (humbly) the Universal Recommender.

> We just completed a tool that measures the effects of what we call secondary events or
indicators using the Universal Recommender. It calculates a ranking based precision metric
called mean average precision—MAP@k. We took a dataset from the Rotten Tomatoes web site
of “fresh”, and “rotten” reviews and combined that with data about the genres, casts,
directors, and writers of the various video items. This gave us the indicators below:
> like, video-id <== primary indicator
> dislike, video-id
> like-genre, genre-id
> dislike-genre, genre-id
> like-director, director-id
> dislike-director, director-id
> like-writer, writer-id
> dislike-writer, writer-id
> like-cast, cast-member-id
> dislike-cast, cast-member-id
> These aren’t necessarily what we would have chosen if we were designing something from
scratch but are possible to gather from public data.
> We have only ~5000 mostly professional reviewers with ~250k video items in this dataset
but have a larger one we are integrating. We are also writing a white paper and blog post
with some deeper analysis. There are several tidbits of insight when you look deeper.
> The bottom line is that using most of the above indicators we were able to get a 26%
increase in MAP@1 over using only “like”. This is important because the vast majority
of recommenders can only really ingest one type of indicator.
> <>
> <>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message