mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: root LLR support in org.apache.mahout.math.cf.SimilarityAnalysis
Date Tue, 15 Dec 2015 17:01:06 GMT
No, if you want to work on that feel free, it should be pretty easy to add that option. However
be aware that LLR is used in the  downsampling step so you don’t get all elements of llr(A’A)
for reasons that keep the calculation at O(n) downsampling is based on number of non-zero
elements in a row of both A and A’A keeping the highest LLR scoring elements. These are
params that you can control in the current implementation.

For some types of analysis where you would like A’A downsampled based on a purely probabilistic
metric like confidence in non-correlation it might be nice to have a threshold based downsampler
where the threshold is some fraction of all elements or some confidence value rather than
a fixed value of LLR, which is trivial to add but not very useful. This requires that we find
a way to calculate the distribution parameters of LLR in A’A so a confidence threshold can
be derived. I haven’t put a lot of thought into this but iirc LLR is Chi-square with 2 degrees
of freedom (going from old brain cells here) and root LLR is normally distributed.  If there
is some clever way to find the threshold without calculating all of rllr(A’A), which would
be O(n^2), then the confidence threshold downsampling could be kept O(n) and this would be
a very useful contribution.


On Dec 14, 2015, at 8:04 PM, Nikaash Puri <nikaashpuri@gmail.com> wrote:

Hi,

Just wondering whether there is support to use root Log Likelihood Ratio
using some sort of flag in the cooccurrencesIDSs function
in org.apache.mahout.math.cf.SimilarityAnalysis. Else, I can create and
issue and work on it to add said support.

Thank you,
Nikaash Puri


Mime
View raw message