mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Mahout SSVD is too slow for highly dimensional data
Date Wed, 12 Jun 2013 14:13:53 GMT
Yes somebody has mentioned this just another week too on this list and we
even had a mahout issue filed.  yes this is hadoop specific, there s this
bug that limits use of sort buffers per description. Keep small sort
buffers or upgrade (or degrade if you feel like). as usual it is with
hadoop, one operational bug fix seems to come with a couple of new ones.
On Jun 12, 2013 12:55 AM, "Yehia Zakaria" <y.zakaria@fci-cu.edu.eg> wrote:

> Thanks a lot Ted and Dmitiriy
>
> Keeping k = 100 solved the problem and Q-Job passed successfully. Actually
> I am evaluating mahout ssvd performance, so I have chosen the rank to be
> 1000 (which 0.1% of the original number of attributes). But I encountered
> another exception in BtJob :
>
> java.io.IOException: Spill failed
> at
>
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1060)
> at java.io.DataOutputStream.writeLong(DataOutputStream.java:207)
> at java.io.DataOutputStream.writeDouble(DataOutputStream.java:242)
> at
> org.apache.mahout.math.VectorWritable.writeVector(VectorWritable.java:150)
> at org.apache.mahout.math.VectorWritable.write(VectorWritable.java:80)
> at
>
> org.apache.mahout.math.hadoop.stochasticsvd.SparseRowBlockWritable.write(SparseRowBlockWritable.java:81)
> at
>
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:100)
> at
>
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:84)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:916)
> at
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:576)
> at
>
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:88)
> at org.a
>
> I searched this issue and it seems it is an open issue on jira, but I am
> not sure how far this issue is related to my exception
>
> https://issues.apache.org/jira/browse/MAPREDUCE-5028
>
> Thanks
> Yehia
>
>
>
>
> On Tue, Jun 11, 2013 at 10:43 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
> >
> > What Ted said. k+p=1001 will make per-task running time quite a bit.
> > Actually i don't think anyone has attempted that many values so I don't
> > even have a sense how long it will take. it still should be cpu-bound
> > though regardless.
> >
> > A much better trade-off is to have fewer values but more precision in
> them
> > with a power iteration (-q 1). Power iteration step (ABt) will definitely
> > have a hard time to multiply with k=1000 just because of the amount of
> data
> > to move around and sort
> >
> >
> > On Tue, Jun 11, 2013 at 12:48 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> >
> > > Don't do that.
> > >
> > > Why do you think you need 1000 singular values?
> > >
> > > Have you tried with k=100, p=15?
> > >
> > > Quite serious, I would expect that you would literally get just as good
> > > results for almost any real application with 100 singular vectors and
> 900
> > > orthogonal noise vectors.
> > >
> > >
> > > On Tue, Jun 11, 2013 at 9:39 AM, Yehia Zakaria <
> y.zakaria@fci-cu.edu.eg
> > > >wrote:
> > >
> > > > Hi
> > > >
> > > > The requested rank (k) is 1000 and p is 1. The input size is 1.2
> > > gigabyte.
> > > >
> > > > Thanks
> > > >
> > > >
> > > >
> > > > On Mon, Jun 10, 2013 at 9:28 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> > > > wrote:
> > > >
> > > > > what is requested rank? This guy will not scale w.r.t rank, only
> w.r.t
> > > > > input size. Reallistically you don't need k>100, p >15.
> > > > >
> > > > > What is the input size (A in Gb?)
> > > > >
> > > > >
> > > > > On Mon, Jun 10, 2013 at 5:31 AM, Yahia Zakaria <
> > > yahiawestlife@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > Hi All
> > > > > >
> > > > > > I am running Mahout SSVD (trunk version) using pca option on
Bag
> of
> > > > Words
> > > > > > dataset (http://archive.ics.uci.edu/ml/datasets/Bag+of+Words).
> This
> > > > > > dataset
> > > > > > have 8000000 instances (rows) and 100000 attributes (columns).
> Mahout
> > > > > SSVD
> > > > > > is too slow, it may take days to finish the first phase of SSVD
> > > (Q-Job)
> > > > > . I
> > > > > > am running the code on a cluster of 16 machines, each one is
8
> cores
> > > > and
> > > > > 32
> > > > > > GB memory. Moreover, the CPU and memory of the workers are not
> > > utilized
> > > > > at
> > > > > > all. While running Mahout SSVD on smaller dataset (12500 rows
and
> > > 5000
> > > > > > columns), it runs too fast, the job was finished in 2 minutes.
Do
> you
> > > > > have
> > > > > > any idea why Mahout SSVD is too slow for high dimensional data
?
> and
> > > to
> > > > > > what extent that SSVD can work efficiently (with respect to
the
> > > number
> > > > of
> > > > > > rows and columns of the input matrix) ?
> > > > > >
> > > > > > Thanks
> > > > > > Yehia
> > > > > >
> > > > >
> > > >
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message