mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Generating a Document Similarity Matrix
Date Wed, 09 Jun 2010 18:25:12 GMT
On Wed, Jun 9, 2010 at 7:14 PM, Jake Mannix <jake.mannix@gmail.com> wrote:
> The ItemSimilarityJob actually uses implementations of the Vector
> class hierarchy?  I think that's the issue - if the on-disk and in-mapper
> representations are never Vectors, then they won't interoperate with
> any of the matrix operations...

Yes they are Vectors.

> And yeah, keying on ints is necessary for now, unless we want to
> make a new matrix type (at least for distributed matrices) which
> keys on longs (which actually might be a good idea: now that
> we're using VInt and VLong, the disk space and network usage
> should be not be adversely affected - just the in-memory
> representation).

Oh I see. Well that's not a problem. Already, IDs have to be mapped to
ints to be used as dimensions in a Vector. So in most cases things are
keyed by these int pseudo-IDs. That's OK too.

A matrix is a bunch of vectors -- at least, that's a nice structure
for a SequenceFile. Row (or col) ID mapped to row (column) vector.

is that not what other jobs are using?
what's the better alternative we could think about converging on.

Mime
View raw message