mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kris Jack <mrkrisj...@gmail.com>
Subject Re: Generating a Document Similarity Matrix
Date Mon, 28 Jun 2010 15:18:07 GMT
Hi,

I am now using the version of
org.apache.mahout.math.hadoop.similarity.RowSimilarityJob that Sebastian has
written and has been added to the trunk.  Thanks again for that!  I can
generate an output file that should contain a list of documents with their
top 100* *most similar documents.  I am having problems, however, in
converting the output file into a readable format using mahout's vectordump:

$ ./mahout vectordump --seqFile similarRows --output results.out --printKey
no HADOOP_CONF_DIR or HADOOP_HOME set, running locally
Input Path: /home/kris/similarRows
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:59)
    at
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
    at
org.apache.mahout.utils.vectors.SequenceFileVectorIterable$SeqFileIterator.hasNext(SequenceFileVectorIterable.java:77)
    at
org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:138)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:174)

What is this doing that takes up so much memory?  A file is produced with
37,952 readable rows but I'm expecting more like 500,000 results, since I
have this number of documents.  Should I be using something else to read the
output file of the RowSimilarityJob?

Thanks,
Kris



2010/6/18 Sebastian Schelter <ssc.open@googlemail.com>

> Hi Kris,
>
> maybe you want to give the patch from
> https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not yet
> tested it with larger data yet, but I would be happy to get some
> feedback for it and maybe it helps you with your usecase.
>
> -sebastian
>
> Am 18.06.2010 18:46, schrieb Kris Jack:
> > Thanks Ted,
> >
> > I got that working.  Unfortunately, the matrix multiplication job is
> taking
> > far longer than I hoped.  With just over 10 million documents, 10 mappers
> > and 10 reducers, I can't get it to complete the job in under 48 hours.
> >
> > Perhaps you have an idea for speeding it up?  I have already been quite
> > ruthless with making the vectors sparse.  I did not include terms that
> > appeared in over 1% of the corpus and only kept terms that appeared at
> least
> > 50 times.  Is it normal that the matrix multiplication map reduce task
> > should take so long to process with this quantity of data and resources
> > available or do you think that my system is not configured properly?
> >
> > Thanks,
> > Kris
> >
> >
> >
> > 2010/6/15 Ted Dunning <ted.dunning@gmail.com>
> >
> >
> >> Threshold are generally dangerous.  It is usually preferable to specify
> the
> >> sparseness you want (1%, 0.2%, whatever), sort the results in descending
> >> score order using Hadoop's builtin capabilities and just drop the rest.
> >>
> >> On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <mrkrisjack@gmail.com>
> wrote:
> >>
> >>
> >>>  I was wondering if there was an
> >>> interesting way to do this with the current mahout code such as
> >>>
> >> requesting
> >>
> >>> that the Vector accumulator returns only elements that have values
> >>>
> >> greater
> >>
> >>> than a given threshold, sorting the vector by value rather than key, or
> >>> something else?
> >>>
> >>>
> >>
> >
>
>


-- 
Dr Kris Jack,
http://www.mendeley.com/profiles/kris-jack/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message