mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlyubi...@apache.org>
Subject Re: Question Regarding Distributed Row Matrix
Date Thu, 05 May 2011 05:49:29 GMT
The interpretation of key in sequence files is subject to restrictions
of a particular algorithm. We held a discussion on this recently, and
i think the consensus was that we don't want to lock DRM as a format
to a particular interpretation of keys in the file -- it is left to
client's code to interpret those and for ultimate goal of
vectorization.

However, different algorithms may interpret it differently. E.g.
stochastic SVD is agnostic of both the key and its class and just
copies it into keys of left eigenvector matrix whereas Lanczos SVD (I
think) requires them to be IntWritable (and may also require them to
be unique -- i am not 100% sure). Similarly, matrix transpose (I
think) would also require them to be IntWritable and on top of them
interpret them as row numbers for the sake of transposition. (I might
be wrong about that last one).

I am not sure about KMeans code.

On Wed, May 4, 2011 at 8:54 PM, Vckay <darkvckay@gmail.com> wrote:
> Hello all,
>  I am trying to create a distributed row matrix of my data which is
> currently available as text input with each line supposed to become a line
> of the distributed row. I am using the Spectral KMeans code as a way of
> understanding how DistributedRowMatrix works and I am sort of confused.
> Specifically: Does DistributedRowMatrix require that the SequenceFiles have
> the row ID as the "Key" ?
> ( The Spectral Kmeans code implements that which is easy because their
> input's first word has that information. However, since as far as I can see
> TextInputFormat just renders a unique byte offset (not necessarily the line
> number), I cant recover the line number from my data. Furthermore, suppose I
> do change my data to say a bunch of images living in a flat directory, I am
> thinking of having "key" being some combination of the file number and this
> byte offset. )
>
> Thanks
>

Mime
View raw message