No, if you want to work on that feel free, it should be pretty easy to add that option. However
be aware that LLR is used in the downsampling step so you don’t get all elements of llr(A’A)
for reasons that keep the calculation at O(n) downsampling is based on number of nonzero
elements in a row of both A and A’A keeping the highest LLR scoring elements. These are
params that you can control in the current implementation.
For some types of analysis where you would like A’A downsampled based on a purely probabilistic
metric like confidence in noncorrelation it might be nice to have a threshold based downsampler
where the threshold is some fraction of all elements or some confidence value rather than
a fixed value of LLR, which is trivial to add but not very useful. This requires that we find
a way to calculate the distribution parameters of LLR in A’A so a confidence threshold can
be derived. I haven’t put a lot of thought into this but iirc LLR is Chisquare with 2 degrees
of freedom (going from old brain cells here) and root LLR is normally distributed. If there
is some clever way to find the threshold without calculating all of rllr(A’A), which would
be O(n^2), then the confidence threshold downsampling could be kept O(n) and this would be
a very useful contribution.
On Dec 14, 2015, at 8:04 PM, Nikaash Puri <nikaashpuri@gmail.com> wrote:
Hi,
Just wondering whether there is support to use root Log Likelihood Ratio
using some sort of flag in the cooccurrencesIDSs function
in org.apache.mahout.math.cf.SimilarityAnalysis. Else, I can create and
issue and work on it to add said support.
Thank you,
Nikaash Puri
