mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: is it possible to compute the SVD for a large scale matrix
Date Fri, 08 Apr 2011 13:57:36 GMT
I don't think it is going to remedy his condition. He is having oom in the
driver and hadoop.env controls heap for tasktracker and such (not even child
task memory). He needs more memory in the frontend which is indeed the
bottleneck for that right now.

apologies for brevity.

Sent from my android.
-Dmitriy
On Apr 8, 2011 4:06 AM, "Danny Bickson" <danny.bickson@gmail.com> wrote:
> Now try to increase heap size in the file conf/hadoop-env.sh
> For example
>
> HADOOP_HEAPSIZE=4000
>
> - Danny Bickson
>
> On Thu, Apr 7, 2011 at 10:32 PM, Wei Li <wei.lee04@gmail.com> wrote:
>
>>
>> Hi Danny and All:
>>
>> I have increased the JVM memory, the mapred.child.java.opts, but still
>> failed after 2 or 3 passes through the corpus.
>>
>> And the matrix dimension is about 600,000 * 600,000, error info is as
>> follows:
>>
>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>> at
>>
org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:434)
>> at
>>
org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:387)
>> at
>>
org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:134)
>> at
>>
org.apache.mahout.math.RandomAccessSparseVector.assign(RandomAccessSparseVector.java:106)
>> at
>>
org.apache.mahout.math.SparseRowMatrix.assignRow(SparseRowMatrix.java:148)
>> at
>>
org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:134)
>> at
>>
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:177)
>> at
>>
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:110)
>> at
>>
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver$DistributedLanczosSolverJob.run(DistributedLanczosSolver.java:253)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>> at
>>
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.main(DistributedLanczosSolver.java:259)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>>
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> at
>>
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> at java.lang.reflect.Method.invoke(Method.java:597)
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>
>>
>> Best
>> Wei
>>
>> On Thu, Apr 7, 2011 at 7:59 AM, Wei Li <wei.lee04@gmail.com> wrote:
>>
>>> Hi All:
>>>
>>> sorry for misunderstanding, the dimension is about 600,000 * 600,000 :)
>>>
>>> Best
>>> Wei
>>>
>>>
>>> On Wed, Apr 6, 2011 at 6:53 PM, Danny Bickson <danny.bickson@gmail.com
>wrote:
>>>
>>>> Hi.
>>>> Do you mean 60 million by 60 million? I guess this may be potentially
>>>> rather big for Mahout.
>>>> Another option you have is to try GraphLab: see
>>>> http://bickson.blogspot.com/2011/04/yahoo-kdd-cup-using-graphlab.html
>>>> I will be happy to give you support in case you would like to try
>>>> GraphLab.
>>>>
>>>> Best,
>>>>
>>>> DB
>>>>
>>>>
>>>> On Wed, Apr 6, 2011 at 2:13 AM, Wei Li <wei.lee04@gmail.com> wrote:
>>>>
>>>>> Hi Danny:
>>>>>
>>>>> I have transformed the csv data into the DistributedRowMatrix
>>>>> format, but it still failed due to the memory problem after 2 or 3
>>>>> iterations.
>>>>>
>>>>> my matrix dimension is about 60w * 60w, it is possible to do the
>>>>> svd decomposition for this scale using Mahout?
>>>>>
>>>>> Best
>>>>> Wei
>>>>>
>>>>>
>>>>> On Sat, Mar 26, 2011 at 1:43 AM, Danny Bickson <
danny.bickson@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi Wei,
>>>>>> You must verify you use SPARSE matrix and not dense, or else you
will
>>>>>> surely get out of memory.
>>>>>> Take a look at this example:
>>>>>>
http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html
>>>>>> On how to prepare the input.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Danny Bickson
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 25, 2011 at 1:33 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>wrote:
>>>>>>
>>>>>>> Wei,
>>>>>>>
>>>>>>> 1) i think DenseMatrix is a RAM-only representation. Naturally,
you
>>>>>>> get OOM because it all has to fit in memory. If you want to run
>>>>>>> RAM-only SVD computation, you perhaps don't need Mahout. If you
want
>>>>>>> to run distributed SVD computations, you need to prepare your
data
in
>>>>>>> what is called DistributedRowMatrix format. This is a sequence
file
>>>>>>> with keys being whatever key you need to identify your rows,
and
>>>>>>> values being VectorWritable wrapping either of vector
implementations
>>>>>>> found in mahout (Dense, sparse sequenctial, sparse random).
>>>>>>> 2) Once you've prepared your data in DRM format, you can run
either
of
>>>>>>> SVD algorithms found in Mahout. It can be Lanczos solver ('mahout
svd
>>>>>>> ... ") or, on the trunk you can also find a stochastic svd method
>>>>>>> ('mahout ssvd ...") which is issue MAHOUT-593 i mentioned earlier.
>>>>>>>
>>>>>>> Either way, I am not sure why you want DenseMatrix unless you
want
to
>>>>>>> use RAM-only Colt SVD solver -- but you certainly don't have
to
focus
>>>>>>> on Mahout implementation of one if you just want a RAM solver.
>>>>>>>
>>>>>>> -d
>>>>>>>
>>>>>>> On Fri, Mar 25, 2011 at 3:25 AM, Wei Li <wei.lee04@gmail.com>
wrote:
>>>>>>> >
>>>>>>> > Actually, I would like to perform the spectral clustering
on a
large
>>>>>>> scale
>>>>>>> > sparse matrix, but it failed due to the OutOfMemory error
when
>>>>>>> creating the
>>>>>>> > DenseMatrix for SVD decomposition.
>>>>>>> >
>>>>>>> > Best
>>>>>>> > Wei
>>>>>>> >
>>>>>>> > On Fri, Mar 25, 2011 at 4:05 PM, Dmitriy Lyubimov <
>>>>>>> dlieu.7@gmail.com> wrote:
>>>>>>> >>
>>>>>>> >> SSVD != Lanczos. if you do PCA or LSI it is perhaps
what you
need.
>>>>>>> it
>>>>>>> >> can take on these things. Well at least some of my branches
can,
if
>>>>>>> >> not the official patch.
>>>>>>> >>
>>>>>>> >> -d
>>>>>>> >>
>>>>>>> >> On Thu, Mar 24, 2011 at 11:09 PM, Wei Li <wei.lee04@gmail.com>
>>>>>>> wrote:
>>>>>>> >> >
>>>>>>> >> > thanks for your reply
>>>>>>> >> >
>>>>>>> >> > my matrix is not very dense, a sparse matrix.
>>>>>>> >> >
>>>>>>> >> > I have tried the svd of Mahout, but failed due
to the
OutOfMemory
>>>>>>> error.
>>>>>>> >> >
>>>>>>> >> > Best
>>>>>>> >> > Wei
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >> > On Fri, Mar 25, 2011 at 2:03 PM, Dmitriy Lyubimov
<
>>>>>>> dlieu.7@gmail.com>
>>>>>>> >> > wrote:
>>>>>>> >> >>
>>>>>>> >> >> you can certainly try to write it out into
a DRM (distributed
>>>>>>> row
>>>>>>> >> >> matrix) and run stochastic SVD on hadoop (off
the trunk now).
>>>>>>> see
>>>>>>> >> >> MAHOUT-593. This is suitable if you have a
good decay of
>>>>>>> singular
>>>>>>> >> >> values (but if you don't it probably just means
you have so
much
>>>>>>> noise
>>>>>>> >> >> that it masks the problem you are trying to
solve in your
data).
>>>>>>> >> >>
>>>>>>> >> >> Current committed solution is not most efficient
yet, but it
>>>>>>> should be
>>>>>>> >> >> quite capable.
>>>>>>> >> >>
>>>>>>> >> >> If you do, let me know how it went.
>>>>>>> >> >>
>>>>>>> >> >> thanks.
>>>>>>> >> >> -d
>>>>>>> >> >>
>>>>>>> >> >> On Thu, Mar 24, 2011 at 10:59 PM, Dmitriy Lyubimov
<
>>>>>>> dlieu.7@gmail.com>
>>>>>>> >> >> wrote:
>>>>>>> >> >> > Are you sure your matrix is dense?
>>>>>>> >> >> >
>>>>>>> >> >> > On Thu, Mar 24, 2011 at 9:59 PM, Wei Li
<wei.lee04@gmail.com
>
>>>>>>> wrote:
>>>>>>> >> >> >> Hi All:
>>>>>>> >> >> >>
>>>>>>> >> >> >> is it possible to compute the SVD
factorization for a
>>>>>>> 600,000 *
>>>>>>> >> >> >> 600,000
>>>>>>> >> >> >> matrix using Mahout?
>>>>>>> >> >> >>
>>>>>>> >> >> >> I have got the OutOfMemory error when
creating the
>>>>>>> DenseMatrix.
>>>>>>> >> >> >>
>>>>>>> >> >> >> Best
>>>>>>> >> >> >> Wei
>>>>>>> >> >> >>
>>>>>>> >> >> >
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message