mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <>
Subject Re: Question Regarding Distributed Row Matrix
Date Thu, 05 May 2011 17:32:39 GMT
I guess  i can work on this a little bit from transpose point of view.

let's say your original matrix is 10,000x10,000. Suppose you want to
number the rows in parallel.

You still can have 10 hypothetical reducers which , assuming uniform
distribution of original mapper output, will send the data to 10
reducers and so each reducer will be processing approximately 1,000

now you can have i-th reducer to output ids sequentially in the range
of say i*100,000...(i+1)*100,000

so the output will be Distributed row matrix with potential height of
up to 1,000,000 rows but the number of actuall rows will still be
10,000, so it is technically tall but has huge zero bands of missing

when you transpose, that will result in sparse sequential vectors that
will not actually have 1m elements in that but still just 10,000
non-zero elements albeit it will have to encode the 'gaps' of zero
elements, so there will be some space overhead for that -- but it's
too small compared to what 10k non-zero elements would occupy.

So what i am trying to say that either pre-transposed input or
transposed output will not increase in size or flops just because your
numbering is not strictly sequential .


On Thu, May 5, 2011 at 10:21 AM, Dmitriy Lyubimov <> wrote:
>> though to be frank, I don't understand your second paragraph i.e, how
>> turning the vectors into sparse vectors will enable me to do transpose in a
>> easier fashion without resorting to doing it manually), however, I suppose
>> the purpose of the DRM format was to make step 5,6 much easier so I guess I
> What i meant, since you can use sparse vectors, you don't have to
> number them strictly sequentially with one reducer. You still might
> have several reducers that would number them sequentially within just
> single reducer's range but not universally and it still will not be
> detrimental from the problem size point of view.
> -d
>> Thanks again!
>> On Thu, May 5, 2011 at 9:40 AM, Dmitriy Lyubimov <> wrote:
>>> I think first step is to decide on pipeline of algorithms. Once u know the
>>> algorithms u want to run thru, it would be easier to come up with
>>> vectorization requirements.
>>> That said, for the sake of trasposition, note that mahout supports sparse
>>> vectors, I. e. It doesn't matter what the element index is, for as long as
>>> it unique, only how many nonzero elements, does. So I don't think that u
>>> are
>>> per se constrained in number of reducers during vectorization for
>>> transpose.
>>> That would have been pretty scale restricting, indeed.
>>> apologies for brevity.
>>> Sent from my android.
>>> -Dmitriy
>>> On May 5, 2011 6:58 AM, "Vckay" <> wrote:

View raw message