mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vckay <darkvc...@gmail.com>
Subject Re: Question Regarding Distributed Row Matrix
Date Thu, 05 May 2011 13:57:53 GMT
OK. I do plan to use SVD and transpose. Assuming you are correct, I am
curious then: How are people solving this problem? (Surely not all data has
row tags in it). A solution I had in mind was to use a single reducer (have
one key coming in from mapper) so that the single reducer is able to put in
a row number. However, this is not a clean solution since it appears to have
to do it serially.

On Thu, May 5, 2011 at 12:49 AM, Dmitriy Lyubimov <dlyubimov@apache.org>wrote:

> The interpretation of key in sequence files is subject to restrictions
> of a particular algorithm. We held a discussion on this recently, and
> i think the consensus was that we don't want to lock DRM as a format
> to a particular interpretation of keys in the file -- it is left to
> client's code to interpret those and for ultimate goal of
> vectorization.
>
> However, different algorithms may interpret it differently. E.g.
> stochastic SVD is agnostic of both the key and its class and just
> copies it into keys of left eigenvector matrix whereas Lanczos SVD (I
> think) requires them to be IntWritable (and may also require them to
> be unique -- i am not 100% sure). Similarly, matrix transpose (I
> think) would also require them to be IntWritable and on top of them
> interpret them as row numbers for the sake of transposition. (I might
> be wrong about that last one).
>
> I am not sure about KMeans code.
>
> On Wed, May 4, 2011 at 8:54 PM, Vckay <darkvckay@gmail.com> wrote:
> > Hello all,
> >  I am trying to create a distributed row matrix of my data which is
> > currently available as text input with each line supposed to become a
> line
> > of the distributed row. I am using the Spectral KMeans code as a way of
> > understanding how DistributedRowMatrix works and I am sort of confused.
> > Specifically: Does DistributedRowMatrix require that the SequenceFiles
> have
> > the row ID as the "Key" ?
> > ( The Spectral Kmeans code implements that which is easy because their
> > input's first word has that information. However, since as far as I can
> see
> > TextInputFormat just renders a unique byte offset (not necessarily the
> line
> > number), I cant recover the line number from my data. Furthermore,
> suppose I
> > do change my data to say a bunch of images living in a flat directory, I
> am
> > thinking of having "key" being some combination of the file number and
> this
> > byte offset. )
> >
> > Thanks
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message