From user-return-14828-apmail-mahout-user-archive=mahout.apache.org@mahout.apache.org Mon Sep 3 19:54:26 2012
Return-Path:
X-Original-To: apmail-mahout-user-archive@www.apache.org
Delivered-To: apmail-mahout-user-archive@www.apache.org
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
by minotaur.apache.org (Postfix) with SMTP id DCE86DBD8
for ; Mon, 3 Sep 2012 19:54:26 +0000 (UTC)
Received: (qmail 88368 invoked by uid 500); 3 Sep 2012 19:54:25 -0000
Delivered-To: apmail-mahout-user-archive@mahout.apache.org
Received: (qmail 88255 invoked by uid 500); 3 Sep 2012 19:54:25 -0000
Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
List-Help:
List-Unsubscribe:
List-Post:
List-Id:
Reply-To: user@mahout.apache.org
Delivered-To: mailing list user@mahout.apache.org
Received: (qmail 88244 invoked by uid 99); 3 Sep 2012 19:54:25 -0000
Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136)
by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Sep 2012 19:54:25 +0000
X-ASF-Spam-Status: No, hits=1.5 required=5.0
tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS
X-Spam-Check-By: apache.org
Received-SPF: pass (athena.apache.org: domain of ted.dunning@gmail.com designates 74.125.82.50 as permitted sender)
Received: from [74.125.82.50] (HELO mail-wg0-f50.google.com) (74.125.82.50)
by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Sep 2012 19:54:19 +0000
Received: by wgbds11 with SMTP id ds11so3250895wgb.7
for ; Mon, 03 Sep 2012 12:53:57 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=20120113;
h=mime-version:in-reply-to:references:from:date:message-id:subject:to
:content-type;
bh=7goCpbAW6YLnG1rxbVfOJCTtF1BCZvRpt7ZPAiVa4F0=;
b=0wzSYh/2gU2Pci5cBzcGq3KWTAcLJ/Pb1Q5kdnDCVs7KdcUhg6FU31IzoFyBwDnUKG
3Nz7uBM3ObXmra3EuX9sLxbQGGXcmWKR6G208DboP++9FMOoIfvdcMSGyt+3OmdKSmcn
P1i1/Bckqf3Q7t4kfPVNH4Nfb2+YtZh+6vKVT0v6bdRdgQebv9dk+23Xgdy9w8RxCiml
ydsDSVojiqBu9Gk0RVUr+AtWnzyVxPhSPA28Gp8m9+OwY5RqV/9y4m8Z7Z5lXTNb927R
dnKTUS8Qala1t0ss/ypnjPfXqivv8YnXsZHkFGaZgG2fKCdYBgQW6S2cxTWCQf/ugpwH
PZHw==
Received: by 10.180.86.133 with SMTP id p5mr25448482wiz.17.1346702037566; Mon,
03 Sep 2012 12:53:57 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.223.63.211 with HTTP; Mon, 3 Sep 2012 12:53:27 -0700 (PDT)
In-Reply-To:
References:
From: Ted Dunning
Date: Mon, 3 Sep 2012 15:53:27 -0400
Message-ID:
Subject: Re: SSVD Wrong Singular Vectors
To: user@mahout.apache.org
Content-Type: multipart/alternative; boundary=f46d0442820e91436504c8d18051
X-Virus-Checked: Checked by ClamAV on apache.org
--f46d0442820e91436504c8d18051
Content-Type: text/plain; charset=UTF-8
The dense random matrices should do somewhat better than the kind of sparse
trinary matrix.
On Mon, Sep 3, 2012 at 3:48 PM, Dmitriy Lyubimov wrote:
> There is a question still i guess that we never looked at which is the
> effect of different kind of distribution samplers for the Omega matrix
> (0-mean uniform vs. normal vs. trinary) on small problems though,
> although small matrices is not what Mahout is about.
>
> On Mon, Sep 3, 2012 at 12:23 PM, Dmitriy Lyubimov
> wrote:
> > ok so i ran Ahmed's numbers with optimal, R sequential version (ssvd.R
> > on the wiki) and actual MR. Bottom line, both R ssvd and MR are
> > producing quite a bit of rubbish for these k,p parameters and the long
> > tailed spectrum beyond just 1st singular vector. Obviously it is hard
> > to assert the errors since i'd need a lot of runs to compare and each
> > run creates different output.
> >
> > here we go (1st digits of first 4 singular vectors respectively)
> >
> > optimal
> > s$u[1,1:4]
> > [1] -0.09866006 -0.13195226 -0.16620985 -0.04921311
> >
> > ssvd.R
> > s <- ssvd.svd(A,k=30,p=2,qiter=1)
> >> ss$u[1,1:4]
> > [1] -0.09869577 -0.10368010 0.09543850 0.14122410
> >
> > MR SSVD
> >> U[1,1:4]
> > V2 V3 V4 V5
> > -0.09865305 -0.21822886 0.05959496 0.06561257
> >
> > Of course both versions of SSVD produce exact result with k+p=100
> > which means internal flow is correct. So i don't see necessary
> > discrepancies in the results between MR and ssvd.R in that fairly
> > corner case of things.
> >
> > So this just means that our choice of k,p and q is just too poor for
> > such a flat spectrum and such a small matrix beyond the first singular
> > vector. Can we do better? Sure. As mentioned, k=30,p=70 gives us exact
> > optimal solution:
> >
> >> ss <- ssvd.svd(A,k=30,p=70,qiter=0)
> >> ss$u[1,1:4]
> > [1] -0.09866006 0.13195226 0.16620985 -0.04921311
> >>
> >
> > So this is a rebuttal on the Ahmed's statement
> >> I tried multiple values for the parameters P and Q but, that does not
> seem to solve
> >> the problem.
> >
> > No, it kinda does as demonstrated.
> >
> > Next. Can we try something in between to work such a bad case to get a
> > better trade-off between errors and effort on precision side? sure.
> >
> > First, It doesn't make sense to use p<15. in fact p=15 is the default
> > one for MR version. So: if you ran MR version with default parameters,
> > it should have been something closer to this:
> >
> > (I'll just run algorithm several times for a more consistent
> > assessment of error mean and variance)
> >> ss <- ssvd.svd(A,k=30,p=15,qiter=1)
> >> ss$u[1,1:4]
> > [1] -0.09867691 -0.15782129 -0.14318662 -0.01915917
> > [1] -0.09868661 -0.16606151 -0.14881755 0.06277614
> > [1] -0.09865008 -0.18475458 0.10799953 0.03643756
> > [1] -0.09862877 0.16593100 0.15187904 -0.04660871
> > ...
> > So this default set of parameters is quite an improvement although we
> > probably still get enough error to the right hand of 4th singular
> > vector.
> > (i assume vector sign is not important, it means sign of V is flipped
> > as well i guess).
> >
> >
> > Finally, we can further try to improve with more power iterations.
> > ss <- ssvd.svd(A,k=30,p=15,qiter=2)
> > ss$u[1,1:4]
> >
> >
> >> ss <- ssvd.svd(A,k=30,p=15,qiter=2)
> >> ss$u[1,1:4]
> > [1] -0.09866002 -0.13100879 0.16691754 -0.04754525
> > [1] -0.09866011 -0.13027399 -0.16417364 -0.04453192
> > [1] -0.09866004 -0.13453556 -0.16704920 -0.04517045
> > [1] -0.09866006 -0.12905608 0.16230813 -0.04353147
> >
> > Now that adds even more stability in the 4th vector, doesn't it? The
> > error mean is demonstrably decreased.
> >
> > Bottom line points.
> >
> > First, there's not much point to investigate corner cases -- small
> > matrices, flat spectrum... You may force SSVD to produce exact
> > solution though, but it kind of defeats the purpose, because you can
> > as easily run full rank SVD on a matrix in matlab. Your parameters
> > chosen still produce acceptable results for the first vector where
> > spectrum is trendy but less so where the spectrum is flat.
> >
> > Second, you can control your errors through oversampling (p) and power
> > iterations (q) and you get much flexibility here. You balance
> > precision with flops tradeoff.
> >
> > Third, in real life it doesn't make sense to have high accuracy
> > measuring lfat spectrum. Flat spectrum just means you don't have
> > trends in those directions, i.e. essentially a random noise. If you
> > have random noise, direction of that noise is usually of little
> > interest, but because spectrum (i.e. singular values) is measured
> > better with the method, you just can state that you have mostly noise
> > after certain n-th singular vector and chances are in some cases you
> > just wouldn't care. And if you do care, you still have a choice to
> > work the oversampling and perhaps q iterations.
> >
> > Fourth, unless k+p=rank(A), each run will produce slightly different
> > results and errors so it doesn't make sense to make any error
> > comparison just between single runs of the variations. Instead, it
> > makes sense to compare error mean and variations on a better number of
> > runs.
> >
> > -d
> >
> > On Sun, Sep 2, 2012 at 12:00 AM, Ted Dunning
> wrote:
> >> Did Ahmed even use a power iteration?
> >>
> >> On Sun, Sep 2, 2012 at 1:35 AM, Dmitriy Lyubimov
> wrote:
> >>
> >>> but there is still a concern in a sense that power iterations
> >>> should've helped more than they did. I'll take a closer look but it
> >>> will take me a while to figure if there's something we can improve
> >>> here.
> >>>
>
--f46d0442820e91436504c8d18051--