mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yehia Zakaria <y.zaka...@fci-cu.edu.eg>
Subject Re: Mahout SSVD is too slow for highly dimensional data
Date Wed, 12 Jun 2013 09:51:33 GMT
Hi Ted

No, it is sparse matrix.


Thanks
Yehia


On Wed, Jun 12, 2013 at 11:21 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> Is your input matrix dense?
>
>
> On Wed, Jun 12, 2013 at 9:54 AM, Yehia Zakaria <y.zakaria@fci-cu.edu.eg
> >wrote:
>
> > Thanks a lot Ted and Dmitiriy
> >
> > Keeping k = 100 solved the problem and Q-Job passed successfully.
> Actually
> > I am evaluating mahout ssvd performance, so I have chosen the rank to be
> > 1000 (which 0.1% of the original number of attributes). But I encountered
> > another exception in BtJob :
> >
> > java.io.IOException: Spill failed
> > at
> >
> >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1060)
> > at java.io.DataOutputStream.writeLong(DataOutputStream.java:207)
> > at java.io.DataOutputStream.writeDouble(DataOutputStream.java:242)
> > at
> >
> org.apache.mahout.math.VectorWritable.writeVector(VectorWritable.java:150)
> > at org.apache.mahout.math.VectorWritable.write(VectorWritable.java:80)
> > at
> >
> >
> org.apache.mahout.math.hadoop.stochasticsvd.SparseRowBlockWritable.write(SparseRowBlockWritable.java:81)
> > at
> >
> >
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:100)
> > at
> >
> >
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:84)
> > at
> >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:916)
> > at
> >
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:576)
> > at
> >
> >
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:88)
> > at org.a
> >
> > I searched this issue and it seems it is an open issue on jira, but I am
> > not sure how far this issue is related to my exception
> >
> > https://issues.apache.org/jira/browse/MAPREDUCE-5028
> >
> > Thanks
> > Yehia
> >
> >
> >
> >
> > On Tue, Jun 11, 2013 at 10:43 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> > wrote:
> > >
> > > What Ted said. k+p=1001 will make per-task running time quite a bit.
> > > Actually i don't think anyone has attempted that many values so I don't
> > > even have a sense how long it will take. it still should be cpu-bound
> > > though regardless.
> > >
> > > A much better trade-off is to have fewer values but more precision in
> > them
> > > with a power iteration (-q 1). Power iteration step (ABt) will
> definitely
> > > have a hard time to multiply with k=1000 just because of the amount of
> > data
> > > to move around and sort
> > >
> > >
> > > On Tue, Jun 11, 2013 at 12:48 AM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> > >
> > > > Don't do that.
> > > >
> > > > Why do you think you need 1000 singular values?
> > > >
> > > > Have you tried with k=100, p=15?
> > > >
> > > > Quite serious, I would expect that you would literally get just as
> good
> > > > results for almost any real application with 100 singular vectors and
> > 900
> > > > orthogonal noise vectors.
> > > >
> > > >
> > > > On Tue, Jun 11, 2013 at 9:39 AM, Yehia Zakaria <
> > y.zakaria@fci-cu.edu.eg
> > > > >wrote:
> > > >
> > > > > Hi
> > > > >
> > > > > The requested rank (k) is 1000 and p is 1. The input size is 1.2
> > > > gigabyte.
> > > > >
> > > > > Thanks
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jun 10, 2013 at 9:28 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > what is requested rank? This guy will not scale w.r.t rank,
only
> > w.r.t
> > > > > > input size. Reallistically you don't need k>100, p >15.
> > > > > >
> > > > > > What is the input size (A in Gb?)
> > > > > >
> > > > > >
> > > > > > On Mon, Jun 10, 2013 at 5:31 AM, Yahia Zakaria <
> > > > yahiawestlife@gmail.com
> > > > > > >wrote:
> > > > > >
> > > > > > > Hi All
> > > > > > >
> > > > > > > I am running Mahout SSVD (trunk version) using pca option
on
> Bag
> > of
> > > > > Words
> > > > > > > dataset (http://archive.ics.uci.edu/ml/datasets/Bag+of+Words).
> > This
> > > > > > > dataset
> > > > > > > have 8000000 instances (rows) and 100000 attributes (columns).
> > Mahout
> > > > > > SSVD
> > > > > > > is too slow, it may take days to finish the first phase
of SSVD
> > > > (Q-Job)
> > > > > > . I
> > > > > > > am running the code on a cluster of 16 machines, each one
is 8
> > cores
> > > > > and
> > > > > > 32
> > > > > > > GB memory. Moreover, the CPU and memory of the workers
are not
> > > > utilized
> > > > > > at
> > > > > > > all. While running Mahout SSVD on smaller dataset (12500
rows
> and
> > > > 5000
> > > > > > > columns), it runs too fast, the job was finished in 2 minutes.
> Do
> > you
> > > > > > have
> > > > > > > any idea why Mahout SSVD is too slow for high dimensional
data
> ?
> > and
> > > > to
> > > > > > > what extent that SSVD can work efficiently (with respect
to the
> > > > number
> > > > > of
> > > > > > > rows and columns of the input matrix) ?
> > > > > > >
> > > > > > > Thanks
> > > > > > > Yehia
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message