Well, even with sparse data, your problem is probably still quite small for
this.
Btw if i have time i will probably put this method into spark rdd and bagel
which should speed things up by removing some inevitable sorting overhead.
In fact, methinks, having mahout sparse vectors and matrices as spark rdds
should handle a lot of ills that sideline some of current mahout
implementations.
On Feb 19, 2013 10:48 AM, "K.D.P. Ross" <kdp@quixey.com> wrote:
> Just to follow up: I now have my real data, which, is much
> sparser than the totallyrandom data … and, unsurprisingly,
> it exhibits a good bit more regularity, so it's compressible
> to the point that the ondisc SequenceFile is small enough
> that there's only a single map job, which, of course, means
> that the problem that I was experiencing doesn't arise at
> all.
>
> Incidentally, with the random data, I *did* get the same
> behaviour when I ran on a ‘real’ Hadoop cluster (It's the
> full Hadoop stack, running on a single box.)
>
> On Thu, Feb 14, 2013 at 9:56 AM, K.D.P. Ross <kdp@quixey.com> wrote:
> > Appreciate the replies!
> >
> >> Yes this problem has been pretty much beaten to shreds. In
> >> fact so much so i wrote it into troubleshooting in section
> >> 5 of the manual
> >> (
> https://cwiki.apache.org/confluence/download/attachments/27832158/SSVDCLI.pdf?version=17&modificationDate=1349999085000
> ).
> >
> > Aha, it looks like I had an outofdate version of that
> > file! I grabbed it from here:
> >
> >
> https://cwiki.apache.org/MAHOUT/stochasticsingularvaluedecomposition.data/SSVDCLI.pdf
> >
> > linked to from this page:
> >
> >
> https://cwiki.apache.org/MAHOUT/stochasticsingularvaluedecomposition.html
> >
> > The FAQ section wasn't yet written, it looks like.
> >
> >> Perhaps I can suggest as a first measure to run a simple
> >> local MR job on your file which just counts # of rows in
> >> every map split. You should not see any that is less than
> >> k+p (110?). Since you are using local mode and not actual
> >> hdfs blocks, there may be some irregularities.
> >
> > Indeed, this was the problem: I saw that all but the last
> > split contained 889 rows … but that the final one was of
> > size 107. I tinkered with the parameters and this got me
> > sorted; specifically, I added the following to my ‘JobConf’:
> >
> > JobConf conf = new JobConf();
> > conf.setLong("mapred.min.split.size", 75570350L);
> >
> > where ‘75570350L’ was an empiricallyderived ‘largeenough’
> > number. With that change made, the SSVD completed
> > successfully.
> >
> >> Also since random matrices exhibit just as much variance
> >> in every direction, random projection will not be able to
> >> reduce problem efficiently. (meaning the singular vectors
> >> of the final solution will be all over the place compared
> >> to technically optimal solution). Tests on random matrices
> >> are not meaningful for precision assessment purposes; only
> >> inputs with good spectrum decay are (as in tests). But it
> >> looks like many people are trying to do just that.
> >
> > Oh, right … I didn't have the real data available but wanted
> > to get some idea of the feasibility of using the Mahout SSVD
> > on input that was vaguely the right size … I didn't expect
> > anything meaningful to come out :~}
> >
> > I'm going to get the actual data ready and run it ‘for real’
> > now, which, ought to produce something a bit more
> > interesting.
>
