mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Generating a Document Similarity Matrix
Date Wed, 09 Jun 2010 18:25:12 GMT
On Wed, Jun 9, 2010 at 7:14 PM, Jake Mannix <> wrote:
> The ItemSimilarityJob actually uses implementations of the Vector
> class hierarchy?  I think that's the issue - if the on-disk and in-mapper
> representations are never Vectors, then they won't interoperate with
> any of the matrix operations...

Yes they are Vectors.

> And yeah, keying on ints is necessary for now, unless we want to
> make a new matrix type (at least for distributed matrices) which
> keys on longs (which actually might be a good idea: now that
> we're using VInt and VLong, the disk space and network usage
> should be not be adversely affected - just the in-memory
> representation).

Oh I see. Well that's not a problem. Already, IDs have to be mapped to
ints to be used as dimensions in a Vector. So in most cases things are
keyed by these int pseudo-IDs. That's OK too.

A matrix is a bunch of vectors -- at least, that's a nice structure
for a SequenceFile. Row (or col) ID mapped to row (column) vector.

is that not what other jobs are using?
what's the better alternative we could think about converging on.

View raw message