mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Cooccurrence to align different categorization systems (many to many occurrence)
Date Fri, 16 Jul 2010 17:46:55 GMT
This is a good example of an abstract recommendation task, but I am not sure
that this is the best framework for what you want to do.

The approaches that I would use would start with two item x category
matrices that describe your two categorization systems.  Then what you want
to do is predict cooccurrence (actual or theoretical) between items of
category 1 with items of category 2.

I see at least three ways that are likely to accomplish this using Mahout.

The simplest is to simply use the cooccurrence counter and log-likelihood
measure to find all interesting cooccurent categories.  This will give you
pairs of categories that might or might not be from different categorization
schemes.  You could filter out the uninteresting ones and have your list of
potential pairs.

Another approach would be to use frequent itemset mining with the same goal
as the first approach.

>From there, I would move to latent variable techniques.  The idea is that
you should be able to describe your items and your categories in terms of
internal variables.  Similarity of internal representation should be
something like what you need.  The two major systems that produce latent
variable representations available in Mahout are SVD and LDA.  With SVD, you
define a matrix A that is the column-wise adjunct of the two item x category
matrices I mentioned above.  When you decompose this you will get left and
right singular vectors (U and V) and a diagonal matrix D such that

    A \approx U D V'

Now V will have as many rows as the sum of the numbers of categories of both
types.  You can, in fact, decompose V into parts corresponding to the types
of categories.  This will give you

   = [ V_1 ]
V  = [     ]
   = [ V_2 ]

you should be able to use the dot product of rows of V_1 versus the rows of
V_2 to get the similarity you want.  You may want to normalize the rows of V
before doing these dot products.  You can do this entire set of dot products
at once using V_1 V_2' but if you have massive numbers of categories that is
probably a bit of over-kill.

Using LDA would be quite similar to this except that the decomposition is
not in terms of a matrix product per se.  You would still come out with the
equivalent of the rows of V for each category and dot products would still
make sense.

On Fri, Jul 16, 2010 at 9:26 AM, Chantal Ackermann <> wrote:

> Hi Sean,
> I wouldn't call it recommendations because the target audience is not
> the end user.
> I would like to do this as a first step to create a mapping between
> those two categorization systems. It's a bit like merging two datasets
> and you would like to now how similar the data in certain (multivalued)
> fields is (say field 1 and field 2).
> This would require pairing each item in field 1 with each item in field
> 2? (Matrix?)
> As a result I would expect something similar to a recommendation system,
> yes. In the sense that when I ask for a value from field 1 I would get
> the values from field 2 that could be seen most equivalent to the input
> value (with some statistical indication if possible).
> I've been rereading the MAHOUT-418 issue (Computing the pairwise
> similarities of the rows of a matrix) and I wonder whether this is what
> I need.
> I've also read through the hadoop word count tutorial and installed
> hadoop (which was as easy as it can be).
> I just don't know where to start as I have not enough experience to
> judge what is relevant for my use case.
> Thanks!
> Chantal
> On Fri, 2010-07-16 at 17:51 +0200, Sean Owen wrote:
> > Lets clarify your situation. You are making recommendations or what?
> > Shouldn't have anything to do with Lucene per se. You do not need Hadoop
> for
> > recommendations if you don't want. ItemSimilarity is not related to
> Hadoop.
> > Yes you can define whatever notion of similarity that you like this way.
> Its
> > up to you not the framework really. But are you doing recommendations?
> >
> > On Jul 16, 2010 2:01 PM, "Chantal Ackermann" <
> >> wrote:
> > > Hi all,
> > >
> > > my goal is to align two slightly different categorization systems where
> > > each categorized item can have multiple categories in one of these
> > > systems.
> > >
> > > E.g.:
> > > Categorized item: "Harry Potter"
> > > Category system 1: Fiction, Fantasy, Children
> > > Category system 2: Youth, Fantasy
> > > The alignment would then produce a similarity between "Fantasy" (used
> in
> > > both systems) and "Children" (1) and Youth (2).
> > >
> > > I *think* ItemSimilarity is what I want but if anyone can provide me
> > > with the correct keywords for googling - that would be great.
> > >
> > > If a Lucene/SOLR index is more efficient as source than the lists I
> have
> > > I'm fine with setting that up. However, I am not sure how the schema
> > > would have to be structured? Would it use the categorized items as
> > > document entities - if not what then?
> > >
> > > Any pointers where to start would be very much appreciated! Also the
> > > information whether I need a full Hadoop installation or whether Mahout
> > > as checked out from trunk is sufficient. It is not very much data
> > > altogether (<10k categorized items).
> > >
> > > Thanks!
> > > Chantal
> > >
> > >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message