mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Cooccurrence to align different categorization systems (many to many occurrence)
Date Sat, 17 Jul 2010 11:59:04 GMT
You could construe this as a recommendation problem by thinking of
categories as "users". Then you are simply finding the pairs of users
that are most similar based on their mappings to items. Just use
LogLikelihoodSimilarity and compute all pairs of similarities.

This is a subset of a recommender problem. I imagine it is nowhere
near big enough to need Hadoop either. It's just a matter of setting
up your data in a file or something and writing about 10 lines of

On Fri, Jul 16, 2010 at 12:26 PM, Chantal Ackermann
<> wrote:
> Hi Sean,
> I wouldn't call it recommendations because the target audience is not
> the end user.
> I would like to do this as a first step to create a mapping between
> those two categorization systems. It's a bit like merging two datasets
> and you would like to now how similar the data in certain (multivalued)
> fields is (say field 1 and field 2).
> This would require pairing each item in field 1 with each item in field
> 2? (Matrix?)
> As a result I would expect something similar to a recommendation system,
> yes. In the sense that when I ask for a value from field 1 I would get
> the values from field 2 that could be seen most equivalent to the input
> value (with some statistical indication if possible).
> I've been rereading the MAHOUT-418 issue (Computing the pairwise
> similarities of the rows of a matrix) and I wonder whether this is what
> I need.
> I've also read through the hadoop word count tutorial and installed
> hadoop (which was as easy as it can be).
> I just don't know where to start as I have not enough experience to
> judge what is relevant for my use case.
> Thanks!
> Chantal
> On Fri, 2010-07-16 at 17:51 +0200, Sean Owen wrote:
>> Lets clarify your situation. You are making recommendations or what?
>> Shouldn't have anything to do with Lucene per se. You do not need Hadoop for
>> recommendations if you don't want. ItemSimilarity is not related to Hadoop.
>> Yes you can define whatever notion of similarity that you like this way. Its
>> up to you not the framework really. But are you doing recommendations?
>> On Jul 16, 2010 2:01 PM, "Chantal Ackermann" <
>>> wrote:
>> > Hi all,
>> >
>> > my goal is to align two slightly different categorization systems where
>> > each categorized item can have multiple categories in one of these
>> > systems.
>> >
>> > E.g.:
>> > Categorized item: "Harry Potter"
>> > Category system 1: Fiction, Fantasy, Children
>> > Category system 2: Youth, Fantasy
>> > The alignment would then produce a similarity between "Fantasy" (used in
>> > both systems) and "Children" (1) and Youth (2).
>> >
>> > I *think* ItemSimilarity is what I want but if anyone can provide me
>> > with the correct keywords for googling - that would be great.
>> >
>> > If a Lucene/SOLR index is more efficient as source than the lists I have
>> > I'm fine with setting that up. However, I am not sure how the schema
>> > would have to be structured? Would it use the categorized items as
>> > document entities - if not what then?
>> >
>> > Any pointers where to start would be very much appreciated! Also the
>> > information whether I need a full Hadoop installation or whether Mahout
>> > as checked out from trunk is sufficient. It is not very much data
>> > altogether (<10k categorized items).
>> >
>> > Thanks!
>> > Chantal
>> >
>> >

View raw message