mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: User similarity in Mahout
Date Sun, 03 Jan 2016 17:33:53 GMT
Your problem will be that there isn’t enough cooccurrence between users since, well, how
many jobs can any one user apply for and how likely is another user to apply for the same
or overlapping jobs? The JDs have a short lifetime and so don’t lend themselves to the older
single action recommenders. The cooccurrences you show below are probably optimistic. I know
this from public statements made by CareerBuilder. Not to mention direct experience with a
similar use case. 

I’d expect collaborative filtering based on any one action, like "applying for a job" to
give very poor results for you. CB tried this an got some decent results only  for people
with a large number of applications—but this was a small % of cases.

Sooo, their solution was a content based recommender that basically matched resume’s to
Job descriptions based on content similarity. To get this to work well you may need things
like NLP to get named entities or at least a robust gazetteer that knows a large number of
brand and technology names. There are also parsing services that will extract info from resume’s.
This is a long and somewhat complicated path and has little to do with Mahout.

A much simpler path is to use cross-cooccurrence with the newer SimilarityAnalysis.cooccurrence
part of Mahout-Samsara that runs on Spark. It will allow you to use many more user actions,
ones that may give more overlap between user activity. This is collaborative filtering but
can ingest user actions that are different from “apply”, and whose targets are not restricted
to Job Descriptions.

In this case you have or may be able to collect the following indicators of user preference:

1) user-id, “apply”, job-description-id: from actual application, this is what you want
people to do—“apply” so it’s the closest indicator of user preference—assuming you
don’t have information about whether they were accepted for a job, which might be even better.
2) user-id, “view”, job-description-id: from when a user reads the details of a JD
3) user-id, “category-preference”, category-id: again taken when a user “view”s a
JD but the target of the action is the category of the JD, not the JD itself
4) user-id, “job-title-preference”, job-title-token: Take the job title and tokenize it,
then feed in each token (minus stop words) as if they were “tags”. This could be taken
when a user “view”s a JD
5) user-id, “other-JD-meta”, metadata-id: this could be anything about the JD that you
know and is collected for users that “view” the JD. If you have tags, this would be a
good way to use them.

You may also have user profile info taken from their resume, for instance their current job
title, these can be encoded:
6) user-id, “current-title”, job-title: here it might be necessary to tokenize and feed
each token in unless you have some standardized list of titles. This is taken when a user
enters their information into your app.

The idea is to find many ways that users of your system can have data that is in common with
other users. Then the recommender (I’ll describe next) will use a signal like “job-title-preference”
or “view” even in cases where the user has never applied for a job and so would have none
of the data you mention.

As far as I know the only end-to end, mostly off-the-shelf, implementation of this that uses
Mahout is the Universal Recommender here:
It is built on the PredictionIO Framework described here:
It supports any number of the “secondary” indicators—things like #2-#6, and is integrated
with an event store and recommendation server. The Mahout docs for the command line version
of cooccurrence analysis are here (in case you want to build your own framework):

I seriously doubt the older Mahout hadoop-based recommenders will help since they can only
use one indicator.

> On Jan 3, 2016, at 7:01 AM, Peter K <> wrote:
> Hi all,
> I'm trying to implement a recommender based 
> on Mahout to recommend jobs for users. 
> There are 2 actions - an user applied for a job or 
> viewed a job. In terms of weight I'm using 5 for 
> an apply and 2 for a view.
> Now I'm trying to find best user similarity to capture 
> these relations.
> For example:
> User1 applied to jobs: J1,J2,J3,J4,J5
> User2 applied to jobs: J1,J2,J3,J4,J6
> User3 applied to jobs: J1, J7
> When using Euclidean distance similarity if I'm not mistaken 
> users 2 and 3 are equal (when 
> calculating similarity to User1). But I feel User2 is more similar 
> and thus J6 should be 
> higher in the recommendations than J7.
> Generally, I'm looking into more suggestions what algorithms 
> might be the best for this 
> case.
> Thank you very much for any suggestions.
> P.

View raw message