From user-return-16254-apmail-mahout-user-archive=mahout.apache.org@mahout.apache.org Thu Jan 31 03:15:52 2013 Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 376BEEC33 for ; Thu, 31 Jan 2013 03:15:52 +0000 (UTC) Received: (qmail 83294 invoked by uid 500); 31 Jan 2013 03:15:50 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 83262 invoked by uid 500); 31 Jan 2013 03:15:50 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 83218 invoked by uid 99); 31 Jan 2013 03:15:48 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 31 Jan 2013 03:15:48 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of prabhu@mediaiqdigital.com designates 91.207.51.201 as permitted sender) Received: from [91.207.51.201] (HELO server.media-iq.co.uk) (91.207.51.201) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 31 Jan 2013 03:15:40 +0000 Received: from [122.178.202.85] (port=14082 helo=PrabhuTOSH) by server.media-iq.co.uk with esmtp (Exim 4.80) (envelope-from ) id 1U0kcR-0006pG-8l for user@mahout.apache.org; Thu, 31 Jan 2013 03:15:19 +0000 From: "Prabhu" To: References: <035201cdfee3$2069ac40$613d04c0$@mediaiqdigital.com> In-Reply-To: Subject: RE: Logistic Regression in Mahout Date: Thu, 31 Jan 2013 08:45:12 +0530 Message-ID: <040b01cdff61$2f7acce0$8e7066a0$@mediaiqdigital.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Outlook 14.0 Thread-Index: AQKLK8ufKh67DhqGsyDShvTgOW/4zAF/UO9IltxtpjA= Content-Language: en-gb X-ACL-Warn: { X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - server.media-iq.co.uk X-AntiAbuse: Original Domain - mahout.apache.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - mediaiqdigital.com X-Get-Message-Sender-Via: server.media-iq.co.uk: acl_c_relayhosts_text_entry: prabhu@mediaiqdigital.com|mediaiqdigital.com X-Source: X-Source-Args: X-Source-Dir: X-Virus-Checked: Checked by ClamAV on apache.org Thanks, I thought of that, but that doesn't seem to be the right explanation either For one, in the output I see the equation like TargetVariable ~ -0.001*InterceptTerm + - 0.0006*predictor1 + -0.0004*predictor2 .... Also if I look at the say predictor1, the co-efficient in R is 1.02 and for predictor2 is 0.48 whereas in Mahout, I get -0.00063 for predictor1 and -0.00042 for predictor2. Now if these values are logs of what I am looking for, e^ -0.00063 is 0.999937 and e^ -0.00042 is 0.99958, so the difference is marginal, whereas R co-efficients indicate predictor1 has much higher weightage compared to predictor2 which is what I would expect. Any other thoughts, ideas? Thanks Prabhu -----Original Message----- From: Jake Mannix [mailto:jake.mannix@gmail.com] Sent: 31 January 2013 04:54 To: user@mahout.apache.org Subject: Re: Logistic Regression in Mahout Looks like you're looking at weights which are logs of the weights you think you want. On Wed, Jan 30, 2013 at 4:12 AM, Prabhu wrote: > Hi all, > > I am trying to use Mahout to run logistic regression analysis on > some data. The data is about 7 Million rows, with about 20 predictor > variables (all of them numeric). The target variable is Boolean - 0 or 1. > > I run a logistic regression with this data on R and I get good > co-efficients which makes sense. But when I run a logistic regression > on the exact same data using Mahout, I get co-efficients that don't > make sense. For a start, all co-efficients are negative. The > interesting thing is that the co-efficient (from R) for the most > important variable (with highest > co-efficient) has the least negative value in Mahout. Can someone > please help me understand what the cause of the problem is? > > > > Thanks > > Prabhu > > > > -- -jake