mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Danny Bickson <danny.bick...@gmail.com>
Subject Re: is it possible to compute the SVD for a large scale matrix
Date Fri, 08 Apr 2011 11:06:05 GMT
Now try to increase heap size in the file conf/hadoop-env.sh
For example

HADOOP_HEAPSIZE=4000

- Danny Bickson

On Thu, Apr 7, 2011 at 10:32 PM, Wei Li <wei.lee04@gmail.com> wrote:

>
> Hi Danny and All:
>
>     I have increased the JVM memory, the mapred.child.java.opts, but still
> failed after 2 or 3 passes through the corpus.
>
>     And the matrix dimension is about 600,000 * 600,000, error info is as
> follows:
>
>     Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at
> org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:434)
> at
> org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:387)
> at
> org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:134)
> at
> org.apache.mahout.math.RandomAccessSparseVector.assign(RandomAccessSparseVector.java:106)
> at
> org.apache.mahout.math.SparseRowMatrix.assignRow(SparseRowMatrix.java:148)
> at
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:134)
> at
> org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:177)
> at
> org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:110)
> at
> org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver$DistributedLanczosSolverJob.run(DistributedLanczosSolver.java:253)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> at
> org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.main(DistributedLanczosSolver.java:259)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
>
> Best
> Wei
>
> On Thu, Apr 7, 2011 at 7:59 AM, Wei Li <wei.lee04@gmail.com> wrote:
>
>> Hi All:
>>
>> sorry for misunderstanding, the dimension is about 600,000 * 600,000 :)
>>
>> Best
>> Wei
>>
>>
>> On Wed, Apr 6, 2011 at 6:53 PM, Danny Bickson <danny.bickson@gmail.com>wrote:
>>
>>> Hi.
>>> Do you mean 60 million by 60 million? I guess this may be potentially
>>> rather big for Mahout.
>>> Another option you have is to try GraphLab: see
>>> http://bickson.blogspot.com/2011/04/yahoo-kdd-cup-using-graphlab.html
>>> I will be happy to give you support in case you would like to try
>>> GraphLab.
>>>
>>> Best,
>>>
>>> DB
>>>
>>>
>>> On Wed, Apr 6, 2011 at 2:13 AM, Wei Li <wei.lee04@gmail.com> wrote:
>>>
>>>> Hi Danny:
>>>>
>>>>      I have transformed the csv data into the DistributedRowMatrix
>>>> format, but it still failed due to the memory problem after 2 or 3
>>>> iterations.
>>>>
>>>>      my matrix dimension is about 60w * 60w, it is possible to do the
>>>> svd decomposition for this scale using Mahout?
>>>>
>>>> Best
>>>> Wei
>>>>
>>>>
>>>> On Sat, Mar 26, 2011 at 1:43 AM, Danny Bickson <danny.bickson@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi Wei,
>>>>> You must verify you use SPARSE matrix and not dense, or else you will
>>>>> surely get out of memory.
>>>>> Take a look at this example:
>>>>> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html
>>>>> On how to prepare the input.
>>>>>
>>>>> Best,
>>>>>
>>>>> Danny Bickson
>>>>>
>>>>>
>>>>> On Fri, Mar 25, 2011 at 1:33 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>wrote:
>>>>>
>>>>>> Wei,
>>>>>>
>>>>>> 1) i think DenseMatrix is a RAM-only representation. Naturally, you
>>>>>> get OOM because it all has to fit in memory. If you want to run
>>>>>> RAM-only SVD computation, you perhaps don't need Mahout. If you want
>>>>>> to run distributed SVD computations, you need to prepare your data
in
>>>>>> what is called DistributedRowMatrix format. This is a sequence file
>>>>>> with keys being whatever key you need to identify your rows, and
>>>>>> values being VectorWritable wrapping either of vector implementations
>>>>>> found in mahout (Dense, sparse sequenctial, sparse random).
>>>>>> 2) Once you've prepared your data in DRM format, you can run either
of
>>>>>> SVD algorithms found in Mahout. It can be Lanczos solver ('mahout
svd
>>>>>> ... ") or, on the trunk you can also find a stochastic svd method
>>>>>> ('mahout ssvd ...") which is issue MAHOUT-593 i mentioned earlier.
>>>>>>
>>>>>> Either way, I am not sure why you want DenseMatrix unless you want
to
>>>>>> use RAM-only Colt SVD solver -- but you certainly don't have to focus
>>>>>> on Mahout implementation of one if you just want a RAM solver.
>>>>>>
>>>>>> -d
>>>>>>
>>>>>> On Fri, Mar 25, 2011 at 3:25 AM, Wei Li <wei.lee04@gmail.com>
wrote:
>>>>>> >
>>>>>> > Actually, I would like to perform the spectral clustering on
a large
>>>>>> scale
>>>>>> > sparse matrix, but it failed due to the OutOfMemory error when
>>>>>> creating the
>>>>>> > DenseMatrix for SVD decomposition.
>>>>>> >
>>>>>> > Best
>>>>>> > Wei
>>>>>> >
>>>>>> > On Fri, Mar 25, 2011 at 4:05 PM, Dmitriy Lyubimov <
>>>>>> dlieu.7@gmail.com> wrote:
>>>>>> >>
>>>>>> >> SSVD != Lanczos. if you do PCA or LSI it is perhaps what
you need.
>>>>>> it
>>>>>> >> can take on these things. Well at least some of my branches
can, if
>>>>>> >> not the official patch.
>>>>>> >>
>>>>>> >> -d
>>>>>> >>
>>>>>> >> On Thu, Mar 24, 2011 at 11:09 PM, Wei Li <wei.lee04@gmail.com>
>>>>>> wrote:
>>>>>> >> >
>>>>>> >> > thanks for your reply
>>>>>> >> >
>>>>>> >> > my matrix is not very dense, a sparse matrix.
>>>>>> >> >
>>>>>> >> > I have tried the svd of Mahout, but failed due to the
OutOfMemory
>>>>>> error.
>>>>>> >> >
>>>>>> >> > Best
>>>>>> >> > Wei
>>>>>> >> >
>>>>>> >> >
>>>>>> >> >
>>>>>> >> > On Fri, Mar 25, 2011 at 2:03 PM, Dmitriy Lyubimov <
>>>>>> dlieu.7@gmail.com>
>>>>>> >> > wrote:
>>>>>> >> >>
>>>>>> >> >> you can certainly try to write it out into a DRM
(distributed
>>>>>> row
>>>>>> >> >> matrix) and run stochastic SVD on  hadoop (off
the trunk now).
>>>>>> see
>>>>>> >> >> MAHOUT-593. This is suitable if you have a good
decay of
>>>>>> singular
>>>>>> >> >> values (but if you don't it probably just means
you have so much
>>>>>> noise
>>>>>> >> >> that it masks the problem you are trying to solve
in your data).
>>>>>> >> >>
>>>>>> >> >> Current committed solution is not most efficient
yet, but it
>>>>>> should be
>>>>>> >> >> quite capable.
>>>>>> >> >>
>>>>>> >> >> If you do, let me know how it went.
>>>>>> >> >>
>>>>>> >> >> thanks.
>>>>>> >> >> -d
>>>>>> >> >>
>>>>>> >> >> On Thu, Mar 24, 2011 at 10:59 PM, Dmitriy Lyubimov
<
>>>>>> dlieu.7@gmail.com>
>>>>>> >> >> wrote:
>>>>>> >> >> > Are you sure your matrix is dense?
>>>>>> >> >> >
>>>>>> >> >> > On Thu, Mar 24, 2011 at 9:59 PM, Wei Li <wei.lee04@gmail.com>
>>>>>> wrote:
>>>>>> >> >> >> Hi All:
>>>>>> >> >> >>
>>>>>> >> >> >>    is it possible to compute the SVD factorization
for a
>>>>>> 600,000 *
>>>>>> >> >> >> 600,000
>>>>>> >> >> >> matrix using Mahout?
>>>>>> >> >> >>
>>>>>> >> >> >>    I have got the OutOfMemory error when
creating the
>>>>>> DenseMatrix.
>>>>>> >> >> >>
>>>>>> >> >> >> Best
>>>>>> >> >> >> Wei
>>>>>> >> >> >>
>>>>>> >> >> >
>>>>>> >> >
>>>>>> >> >
>>>>>> >
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message