mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kris Jack <mrkrisj...@gmail.com>
Subject Re: Generating a Document Similarity Matrix
Date Tue, 29 Jun 2010 09:25:05 GMT
Hi Sebastian,

You really are very kind!  I have taken your code and run it to print out
the contents of the output file.  There are indeed only 37,952 results so
that gives me more confidence in the vector dumper.  I'm not sure why there
was a memory problem though, seeing as it seems to have output the results
correctly.  Now I just have to match them up with my original lucene ids and
see how it is performing.  I'll keep you posted with the results.

Thanks,
Kris



2010/6/28 Sebastian Schelter <ssc.open@googlemail.com>

> Hi Kris,
>
> Unfortunately I'm not familiar with the VectorDumper code (and a quick
> look didn't help either), so I can't help you with the OutOfMemoryError.
>
> It could be possible that only 37,952 results are found for an input of
> 500,000 vectors, it really depends on the actual data. If you're sure
> that there should be more results, you could provide me with a sample
> input file and I'll try to find out why there aren't more results.
>
> I wrote a small class for you that dumps the output file of the job to
> the console, (I tested it with the output of my unit-tests), maybe that
> can help us find the source of the problem.
>
> -sebastian
>
> public class MatrixReader extends AbstractJob {
>
>  public static void main(String[] args) throws Exception {
>    ToolRunner.run(new MatrixReader(), args);
>  }
>
>  @Override
>  public int run(String[] args) throws Exception {
>
>    addInputOption();
>
>    Map<String,String> parsedArgs = parseArguments(args);
>    if (parsedArgs == null) {
>      return -1;
>    }
>
>    Configuration conf = getConf();
>    FileSystem fs = FileSystem.get(conf);
>
>    Path vectorFile = fs.listStatus(getInputPath(),
> TasteHadoopUtils.PARTS_FILTER)[0].getPath();
>
>    SequenceFile.Reader reader = null;
>    try {
>      reader = new SequenceFile.Reader(fs, vectorFile, conf);
>      IntWritable key = new IntWritable();
>      VectorWritable value = new VectorWritable();
>
>      while (reader.next(key, value)) {
>        int row = key.get();
>        System.out.print(String.valueOf(key.get()) +  ": ");
>        Iterator<Element> elementsIterator = value.get().iterateNonZero();
>        String separator = "";
>        while (elementsIterator.hasNext()) {
>          Element element = elementsIterator.next();
>          System.out.print(separator + String.valueOf(element.index()) +
> "," + String.valueOf(element.get()));
>          separator = ";";
>        }
>        System.out.print("\n");
>      }
>    } finally {
>      reader.close();
>    }
>    return 0;
>  }
> }
>
> Am 28.06.2010 17:18, schrieb Kris Jack:
> > Hi,
> >
> > I am now using the version of
> > org.apache.mahout.math.hadoop.similarity.RowSimilarityJob that Sebastian
> has
> > written and has been added to the trunk.  Thanks again for that!  I can
> > generate an output file that should contain a list of documents with
> their
> > top 100* *most similar documents.  I am having problems, however, in
> > converting the output file into a readable format using mahout's
> vectordump:
> >
> > $ ./mahout vectordump --seqFile similarRows --output results.out
> --printKey
> > no HADOOP_CONF_DIR or HADOOP_HOME set, running locally
> > Input Path: /home/kris/similarRows
> > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> >     at
> >
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:59)
> >     at
> > org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> >     at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
> >     at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
> >     at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
> >     at
> >
> org.apache.mahout.utils.vectors.SequenceFileVectorIterable$SeqFileIterator.hasNext(SequenceFileVectorIterable.java:77)
> >     at
> > org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:138)
> >     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >     at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >     at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >     at java.lang.reflect.Method.invoke(Method.java:597)
> >     at
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> >     at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >     at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:174)
> >
> > What is this doing that takes up so much memory?  A file is produced with
> > 37,952 readable rows but I'm expecting more like 500,000 results, since I
> > have this number of documents.  Should I be using something else to read
> the
> > output file of the RowSimilarityJob?
> >
> > Thanks,
> > Kris
> >
> >
> >
> > 2010/6/18 Sebastian Schelter <ssc.open@googlemail.com>
> >
> >
> >> Hi Kris,
> >>
> >> maybe you want to give the patch from
> >> https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not yet
> >> tested it with larger data yet, but I would be happy to get some
> >> feedback for it and maybe it helps you with your usecase.
> >>
> >> -sebastian
> >>
> >> Am 18.06.2010 18:46, schrieb Kris Jack:
> >>
> >>> Thanks Ted,
> >>>
> >>> I got that working.  Unfortunately, the matrix multiplication job is
> >>>
> >> taking
> >>
> >>> far longer than I hoped.  With just over 10 million documents, 10
> mappers
> >>> and 10 reducers, I can't get it to complete the job in under 48 hours.
> >>>
> >>> Perhaps you have an idea for speeding it up?  I have already been quite
> >>> ruthless with making the vectors sparse.  I did not include terms that
> >>> appeared in over 1% of the corpus and only kept terms that appeared at
> >>>
> >> least
> >>
> >>> 50 times.  Is it normal that the matrix multiplication map reduce task
> >>> should take so long to process with this quantity of data and resources
> >>> available or do you think that my system is not configured properly?
> >>>
> >>> Thanks,
> >>> Kris
> >>>
> >>>
> >>>
> >>> 2010/6/15 Ted Dunning <ted.dunning@gmail.com>
> >>>
> >>>
> >>>
> >>>> Threshold are generally dangerous.  It is usually preferable to
> specify
> >>>>
> >> the
> >>
> >>>> sparseness you want (1%, 0.2%, whatever), sort the results in
> descending
> >>>> score order using Hadoop's builtin capabilities and just drop the
> rest.
> >>>>
> >>>> On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <mrkrisjack@gmail.com>
> >>>>
> >> wrote:
> >>
> >>>>
> >>>>
> >>>>>  I was wondering if there was an
> >>>>> interesting way to do this with the current mahout code such as
> >>>>>
> >>>>>
> >>>> requesting
> >>>>
> >>>>
> >>>>> that the Vector accumulator returns only elements that have values
> >>>>>
> >>>>>
> >>>> greater
> >>>>
> >>>>
> >>>>> than a given threshold, sorting the vector by value rather than
key,
> or
> >>>>> something else?
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >
> >
>
>


-- 
Dr Kris Jack,
http://www.mendeley.com/profiles/kris-jack/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message