From user-return-17665-apmail-mahout-user-archive=mahout.apache.org@mahout.apache.org Thu Jun 20 09:27:04 2013
Return-Path:
X-Original-To: apmail-mahout-user-archive@www.apache.org
Delivered-To: apmail-mahout-user-archive@www.apache.org
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
by minotaur.apache.org (Postfix) with SMTP id ACFD810A6E
for ; Thu, 20 Jun 2013 09:27:04 +0000 (UTC)
Received: (qmail 16851 invoked by uid 500); 20 Jun 2013 09:27:03 -0000
Delivered-To: apmail-mahout-user-archive@mahout.apache.org
Received: (qmail 16664 invoked by uid 500); 20 Jun 2013 09:27:02 -0000
Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
List-Help:
List-Unsubscribe:
List-Post:
List-Id:
Reply-To: user@mahout.apache.org
Delivered-To: mailing list user@mahout.apache.org
Received: (qmail 16655 invoked by uid 99); 20 Jun 2013 09:27:02 -0000
Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136)
by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Jun 2013 09:27:02 +0000
X-ASF-Spam-Status: No, hits=1.5 required=5.0
tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS
X-Spam-Check-By: apache.org
Received-SPF: pass (athena.apache.org: domain of dangeorge.filimon@gmail.com designates 74.125.82.179 as permitted sender)
Received: from [74.125.82.179] (HELO mail-we0-f179.google.com) (74.125.82.179)
by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Jun 2013 09:26:57 +0000
Received: by mail-we0-f179.google.com with SMTP id w59so5303724wes.10
for ; Thu, 20 Jun 2013 02:26:36 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=20120113;
h=mime-version:in-reply-to:references:from:date:message-id:subject:to
:content-type;
bh=ZpbT+6T30TGTEuiKb6XCnFCoDT913R1jWaO0pWJM1iI=;
b=gNDDaXYr1lYhAdo70NM8+w6aKmb9S2Zqje/lJZZkgFgKohCYwj1XfijLJiAe7s7RSD
3egQ16ypialyj9YhtsDf+TVVjdX8LrHTYBrHzID1PeLS4dNFCTHrxTtPH/3nhahJefFm
Oms7EZTSnDm8IO5K9SOB6CEF1PQZbq1pCSoUg6NA1wfy+IzLLv7Y2Vf+nMfHyZBQFy3S
RbALFpef6UT9nZfogwtGlZogQCEVAgItVWKOydcrpSIHQaC7Ky3OgnSyvXZ/RDmLuIR7
xJSh1uISEwqm/LEM7bJSTxNb3dlKcv0FFEAifsWzXQDwxJY7NXzjKz0lLraW9YU88Yp6
NKxw==
X-Received: by 10.195.13.195 with SMTP id fa3mr4982133wjd.80.1371720396704;
Thu, 20 Jun 2013 02:26:36 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.194.61.80 with HTTP; Thu, 20 Jun 2013 02:25:56 -0700 (PDT)
In-Reply-To:
References:
From: Dan Filimon
Date: Thu, 20 Jun 2013 12:25:56 +0300
Message-ID:
Subject: Re: Log-likelihood ratio test as a probability
To: user@mahout.apache.org
Content-Type: multipart/alternative; boundary=047d7bdc87e0fa489304df928a49
X-Virus-Checked: Checked by ClamAV on apache.org
--047d7bdc87e0fa489304df928a49
Content-Type: text/plain; charset=UTF-8
Right, makes sense. So, by normalize, I need to replace the counts in the
matrix with probabilities.
So, I would divide everything by the sum of all the counts in the matrix?
On Thu, Jun 20, 2013 at 12:16 PM, Sean Owen wrote:
> I think the quickest answer is: the formula computes the test
> statistic as a difference of log values, rather than log of ratio of
> values. By not normalizing, the entropy is multiplied by a factor (sum
> of the counts) vs normalized. So you do end up with a statistic N
> times larger when counts are N times larger.
>
> On Thu, Jun 20, 2013 at 9:52 AM, Dan Filimon
> wrote:
> > My understanding:
> >
> > Yes, the log-likelihood ratio (-2 log lambda) follows a chi-squared
> > distribution with 1 degree of freedom in the 2x2 table case.
> > A ~A
> > B
> > ~B
> >
> > We're testing to see if p(A | B) = p(A | ~B). That's the null
> hypothesis. I
> > compute the LLR. The larger that is, the more unlikely the null
> hypothesis
> > is to be true.
> > I can then look at a table with df=1. And I'd get p, the probability of
> > seeing that result or something worse (the upper tail).
> > So, the probability of them being similar is 1 - p (which is exactly the
> > CDF for that value of X).
> >
> > Now, my question is: in the contingency table case, why would I
> normalize?
> > It's a ratio already, isn't it?
> >
> >
> > On Thu, Jun 20, 2013 at 11:03 AM, Sean Owen wrote:
> >
> >> someone can check my facts here, but the log-likelihood ratio follows
> >> a chi-square distribution. You can figure an actual probability from
> >> that in the usual way, from its CDF. You would need to tweak the code
> >> you see in the project to compute an actual LLR by normalizing the
> >> input.
> >>
> >> You could use 1-p then as a similarity metric.
> >>
> >> This also isn't how the test statistic is turned into a similarity
> >> metric in the project now. But 1-p sounds nicer. Maybe the historical
> >> reason was speed, or, ignorance.
> >>
> >> On Thu, Jun 20, 2013 at 8:53 AM, Dan Filimon
> >> wrote:
> >> > When computing item-item similarity using the log-likelihood
> similarity
> >> > [1], can I simply apply a sigmoid do the resulting values to get the
> >> > probability that two items are similar?
> >> >
> >> > Is there any other processing I need to do?
> >> >
> >> > Thanks!
> >> >
> >> > [1] http://tdunning.blogspot.ro/2008/03/surprise-and-coincidence.html
> >>
>
--047d7bdc87e0fa489304df928a49--