mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlyubi...@apache.org>
Subject Re: about DistributedRowMatrix implementation
Date Sat, 01 Oct 2011 23:02:44 GMT
Although I sense the discussion is really a bit more than just about reading
inputs one element at a time.

Yes, I guess multiplication is generally 2 passes unless it map side join
which I think though has more interesting prerequisites for the input that a
general drm assumes, I think. I thought map side joins require same sort and
partitioning and drm doesn't assume that in most general case? Although I
have a pretty vague idea how exactly that particular input format does what
it does. It is not supported in the new API and I felt I wanted to abstain
from going back to the deprecated stuff just to have that.

Alright, please never mind.
On Oct 1, 2011 3:45 PM, "Dmitriy Lyubimov" <dlyubimov@apache.org> wrote:
> I have a branch in github that equips vectorWritable with a preprocessor
via
> a Cinfigurable hadoop interface and happily preprocess input element by
> element without creating any heap object in memory.
>
> I proposed to contribute that approach a year ago but it was rejected,
> afaik, on the grounds that push style preprocessor is a "bad" or
> "confusing" pattern to have.
>
> If you want, I can dig that patch out for judgement again.
>
> The benefits if this patch are significant. For once, unbounding width of
> the input for memory, reducing garbage collector pressure, not having to
> have a lot of memory( actually, any extra heap memory) for wide
matrices...
> it makes sense all around anywhere you look at it. Except for "bad"
pattern.
>
> One thing it is though without a doubt, is that it is totally possible(
and
> actually the version of ssvd we were using rans exactly on that
> projection-as-a-single-element-preprocessor pattern).
> On Oct 1, 2011 3:43 PM, "Dmitriy Lyubimov" <dlieu.7@gmail.com> wrote:
>> I have a branch in github that equips vectorWritable with a preprocessor
> via
>> a Cinfigurable hadoop interface and happily preprocess input element by
>> element without creating any heap object in memory.
>>
>> I proposed to contribute that approach a year ago but it was rejected,
>> afaik, on the grounds that push style preprocessor is a "bad" or
>> "confusing" pattern to have.
>>
>> If you want, I can dig that patch out for judgement again.
>>
>> The benefits if this patch are significant. For once, unbounding width of
>> the input for memory, reducing garbage collector pressure, not having to
>> have a lot of memory( actually, any extra heap memory) for wide
> matrices...
>> it makes sense all around anywhere you look at it. Except for "bad"
> pattern.
>>
>> One thing it is though without a doubt, is that it is totally possible(
> and
>> actually the version of ssvd we were using rans exactly on that
>> projection-as-a-single-element-preprocessor pattern).
>>
>> Sent from android tab
>> On Oct 1, 2011 10:42 AM, "Jake Mannix" <jake.mannix@gmail.com> wrote:
>>> Marc,
>>>
>>> If you want to do element-at-a-time multiplication, without putting both
>>> row and
>>> column in memory at a time, this is totally doable, but just not
>>> implemented
>>> in Mahout yet. The current implementation manages to do it in one
>>> map-reduce
>>> pass by doing a mapside join (the CompositeInputFormat thing), but in
>>> general
>>> if you don't do a map-side join, it's 2 passes. In which case, doing
this
>>> element at a time instead of row/column at a time is also 2 passes, and
>>> has no restrictions on how much is in memory at a time.
>>>
>>> I've had some code lying around which started on doing this, but never
>>> had a need just yet. If you open up a JIRA ticket for this, I could post
>>> my code fragments so far, and maybe you (or someone else) could help
>>> finish it off.
>>>
>>> Can you describe a bit about how big your matrices are? Dense matrix
>>> multiplication is an O(N^3) operation, so if N is too large so that even
>>> one row or column cannot fit in memory, then N^3 is not going to finish
>>> any time this year or next, from what I can tell.
>>>
>>> -jake
>>>
>>> On Sat, Oct 1, 2011 at 3:18 AM, Marc Sturlese <marc.sturlese@gmail.com
>>>wrote:
>>>
>>>> Well after digging into the code and do some tests, I've seen that what
> I
>>>> was
>>>> asking for is not possible. Mahout will only let you do a distributed
>>>> matrix
>>>> multiplication of 2 sparse matrix, as the representation of a whole row
>> or
>>>> column has to feed in memory. Actually have to feed in memory a row and
> a
>>>> column each time (as it uses the CompositeInputFormat).
>>>> To do dense matrix multiplication with hadoop just found this:
>>>> http://homepage.mac.com/j.norstad/matrix-multiply/index.html
>>>> But the data generated by the maps will be extremely huge and the job
>> will
>>>> take ages (of course depending of the number of nodes).
>>>> I've seed around that Hama and R are possible solutions too. Any
advice,
>>>> comment or experience?
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>>
>>
>
http://lucene.472066.n3.nabble.com/about-DistributedRowMatrix-implementation-tp3375372p3384669.html
>>>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message