mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vckay <>
Subject Re: Question Regarding Distributed Row Matrix
Date Thu, 05 May 2011 18:06:01 GMT
On Thu, May 5, 2011 at 12:19 PM, Jake Mannix <> wrote:

> Vckay,
>  People don't typically take a raw Text file which has no keys, and build
> a DistributedRowMatrix from it.  You typically have something you want
> to key on (file name, guid from a database, embedded timestamp, etc).
> If you don't have any ids for your rows, you'll need to generate some.

 If you look at what we do in RowIdJob, it maps over a SequenceFile
> of Text -> VectorWritable (which is the output of the seqdirectory
> script: filename -> vector), and turns this into a pair of sequence files,
> Int -> Text, and Int -> VectorWritable.  The first is a "dictionary" of
> what ints (docId) maps to what filename, and the latter is a true
> DistributedRowMatrix, ready for working with transpose, svd, etc.
>  Note that RowIdJob is not truly scalable: it iterates over your entire
> text directly, so it does not use any parallelism.
Ah OK. Thanks a lot. That sounds exactly what I was looking for. To clarify
why I was working with a raw text file: I wanted to make sure I got
everything working on a small file that I could compare with sequentially. I
eventually plan to test out the algorithm on image data where I guess I can
use the file name of the image to identify a row.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message