mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Brickley <dan...@danbri.org>
Subject Re: Any general performance tips for job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?
Date Thu, 20 Oct 2011 11:06:30 GMT
Hi Sean,

On 18 October 2011 11:09, Sean Owen <srowen@gmail.com> wrote:
> Nice question. I have answers I like.
>
> Really, it would be better to find words that mean
> thing-being-recommended-to and thing-being-recommended. I couldn't
> find easy, general terms that were more intuitive than "user" and
> "item". Even though these things need not be actual people or
> products, and so are inaccurate terms, they connote the right sorts of
> ways of thinking about what they are and how they work.
>
> You could also say that since both can be anything, there should be at
> best one term for both -- a thing or entity. I don't like this on the
> same grounds that it makes things harder to think about in practice.
> Is that "thingID" the thing being recommended or recommended to in the
> code...?
>
> More important I don't think users and items are entirely symmetric,
> even though you could plug items in for users and vice versa. For
> instance, one is 'causing' the ratings and the other isn't. It's
> harder to make future predictions about the black-box source of new
> surprising data. That is, I may learn something quite new about you in
> your 1000th rating, when you rate your first classical music album
> ever; the 1000th rating for that same album probably didn't add much
> new info. Users, the causers, are more variable.
>
> And I think you do tend to have an independent/dependent variable, so
> to speak, in any setup. And, the algorithms sort of embed that
> assymmetry. Item-based recommenders aren't quite the same. For example
> it rather encourages you to pre-compute item-item similarity since
> this is likely to be relatively fixed, being the dependent variable.

In general, I completely agree with your perspective here. Even when
everything bottoms out as matrix maths underneath, that doesn't mean
that developers should only ever see that abstraction in their
day-to-day hacking. Mahout lets you adopt at various levels; Taste
gives almost a drop-in running service; the bin/mahout utility and
recommender APIs give a variety of high level entry points, and then
of course being opensource, Java developers can jump into the code at
any level that suits their need. For lots of those entry points,
'user' and 'item' are a great way to present things.

Anyhow, I think my question still holds: is the 'bin/mahout
rowsimilarity' piece of Mahout something that should be understood
primarily as a recommendations-oriented component? For my application
I was seeking just 'the most similar books' for any given book, to
feed those affinities to Gephi for visual mapping. I could
conceptualise this in terms of recommending I guess; but I didn't. So
that's why I was mildly suprised when I noticed that others in Jira
and email did seem to think of rowsimiliarityjob in
recommendation-oriented terms (ie. users and items). I completely
agree that those are useful notions to have in the APIs and utilities,
I just somehow wasn't expecting it right there (just as I wouldn't
expect it on the more mathsy APIs either).

cheers,

Dan

ps. as an aside, your points here also remind me of a few passages in
http://en.wikipedia.org/wiki/Six_Degrees:_The_Science_of_a_Connected_Age
that emphasise how a purely mathemetical perspective on
networks/graphs can obscure the ways in which different kinds of
network can usefully be understood, and that sometimes you do need to
think about the social context alongside the maths...

> On Tue, Oct 18, 2011 at 9:24 AM, Dan Brickley <danbri@danbri.org> wrote:
>> As an aside, I've notice this 'users' terminology lurking in the
>> background of RowSimilarityJob (eg. in JIRA discussion).
>>
>> My use of it last week seemed perfectly reasonable; but rows were
>> books (or bibliographic records), with feature columns from library
>> topic codes. Does the 'user' terminology suggest it's really focussed
>> on recommendations?
>>
>> I'm used to seeing this in the Taste part of Mahout, where sometimes
>> it's suggested we can re-use recommender pieces by eg. thinking more
>> broadly and 'recommending topics to books' or vice versa. This makes
>> sense but introduces an extra layer of conceptual confusion. Is there
>> any important sense in which rows (or columns?) in RowSimilarityJob
>> ought to be thought of as users? Or the values/weights as preferences?
>>
>> cheers,
>>
>> Dan

Mime
View raw message