# mahout-user mailing list archives

##### Site index · List index
Message view
Top
From Sean Owen <sro...@gmail.com>
Subject Re: Log-likelihood ratio test as a probability
Date Thu, 20 Jun 2013 09:28:57 GMT
```Yes that should be all that's needed.
On Jun 20, 2013 10:27 AM, "Dan Filimon" <dangeorge.filimon@gmail.com> wrote:

> Right, makes sense. So, by normalize, I need to replace the counts in the
> matrix with probabilities.
> So, I would divide everything by the sum of all the counts in the matrix?
>
>
> On Thu, Jun 20, 2013 at 12:16 PM, Sean Owen <srowen@gmail.com> wrote:
>
> > I think the quickest answer is: the formula computes the test
> > statistic as a difference of log values, rather than log of ratio of
> > values. By not normalizing, the entropy is multiplied by a factor (sum
> > of the counts) vs normalized. So you do end up with a statistic N
> > times larger when counts are N times larger.
> >
> > On Thu, Jun 20, 2013 at 9:52 AM, Dan Filimon
> > <dangeorge.filimon@gmail.com> wrote:
> > > My understanding:
> > >
> > > Yes, the log-likelihood ratio (-2 log lambda) follows a chi-squared
> > > distribution with 1 degree of freedom in the 2x2 table case.
> > >       A   ~A
> > > B
> > > ~B
> > >
> > > We're testing to see if p(A | B) = p(A | ~B). That's the null
> > hypothesis. I
> > > compute the LLR. The larger that is, the more unlikely the null
> > hypothesis
> > > is to be true.
> > > I can then look at a table with df=1. And I'd get p, the probability of
> > > seeing that result or something worse (the upper tail).
> > > So, the probability of them being similar is 1 - p (which is exactly
> the
> > > CDF for that value of X).
> > >
> > > Now, my question is: in the contingency table case, why would I
> > normalize?
> > > It's a ratio already, isn't it?
> > >
> > >
> > > On Thu, Jun 20, 2013 at 11:03 AM, Sean Owen <srowen@gmail.com> wrote:
> > >
> > >> someone can check my facts here, but the log-likelihood ratio follows
> > >> a chi-square distribution. You can figure an actual probability from
> > >> that in the usual way, from its CDF. You would need to tweak the code
> > >> you see in the project to compute an actual LLR by normalizing the
> > >> input.
> > >>
> > >> You could use 1-p then as a similarity metric.
> > >>
> > >> This also isn't how the test statistic is turned into a similarity
> > >> metric in the project now. But 1-p sounds nicer. Maybe the historical
> > >> reason was speed, or, ignorance.
> > >>
> > >> On Thu, Jun 20, 2013 at 8:53 AM, Dan Filimon
> > >> <dangeorge.filimon@gmail.com> wrote:
> > >> > When computing item-item similarity using the log-likelihood
> > similarity
> > >> > [1], can I simply apply a sigmoid do the resulting values to get the
> > >> > probability that two items are similar?
> > >> >
> > >> > Is there any other processing I need to do?
> > >> >
> > >> > Thanks!
> > >> >
> > >> > [1]
> http://tdunning.blogspot.ro/2008/03/surprise-and-coincidence.html
> > >>
> >
>

```
Mime
• Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message