mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: Setting up a recommender
Date Mon, 21 Apr 2014 20:00:36 GMT
Yes the cooccurrence item similarity matrix is calculated using LLR using Mahout’s RowSimilarityJob.
I guess we are calling this and indicator matrix these days. 

The indicator matrix is then translated from a SequenceFile into a CSV (or other text delimited
file) which looks like a list of itemIDs—tokens or terms in Solr parlance—for each item.
These documents are indexed by Solr and the query is the user history.

[B’B] is pre-calculated by RowSimilarityJob in Mahout. The user history is “multiplied”
by the indicator matrix by using it as the Solr query against the indicator matrix, actually
producing a cosine similarity ranked list of items.

You have to squint a little to see the math. Any matrix product can be substituted with a
row to column similarity metric assuming dimensionality is correct. So the product in all
the equations should be interpreted as such. So to get recs for a user [B’B]h is done in
two phases, one calculates [B’B] and one is a Solr query that adds the ‘h’ to the equation.

In this project both [B’B] and [A’B] are calculated,
the later uses actual matrix multiply, since we did not have a cross-RSJ at the time. Now
that we have a cross cooccurrence in the Spark Scala Mahout 2 stuff I’ll rewrite the code
to use it.

The cross indicator matrix allows you to use two different actions to predict a target action.
So for example views that are similar to purchases can be used to recommend purchases. Take
a look at the readme on github it has a quick review of the theory.

BTW there is a video recommender site that demos some interesting uses of Solr to blend collaborative
filtering recs with metadata. It even makes recs based of of your most recent detail views
on the site. That last doesn’t work all that well because it is really a cross recommendation
and that isn’t built into the site yet.

On Apr 21, 2014, at 12:11 PM, Frank Scholten <> wrote:

Pat and Ted: I am late to the party but this is very interesting!

I am not sure I understand all the steps, though. Do you still create a
cooccurrence matrix and compute LLR scores during this process or do you
only compute matrix multiplication times the history vector: B'B * h and
B'A * h?



On Tue, Aug 13, 2013 at 7:49 PM, Pat Ferrel <> wrote:

> I finally got some time to work on this and have a first cut at output to
> Solr working on the github repo. It only works on 2-action input but I'll
> have that cleaned up soon so it will work with one action. Solr indexing
> has not been tested yet and the field names and/or types may need tweaking.
> It takes the result of the previous drop:
> 1) DRMs for B (user history or B items action1) and A (user history of A
> items action2)
> 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence
> There are two final outputs created using mapreduce but requiring 2
> in-memory hashmaps. I think this will work on a cluster (the hashmaps are
> instantiated on each node) but haven't tried yet. It orders items in #2
> fields by strength of "link", which is the similarity value used in [B'B]
> or [B'A]. It would be nice to order #1 by recency but there is no provision
> for passing through timestamps at present so they are ordered by the
> strength of preference. This is probably not useful and so can be ignored.
> Ordering by recency might be useful for truncating queries by recency while
> leaving the training data containing 100% of available history.
> 1) It joins #1 DRMs to produce a single set of docs in CSV form, which
> looks like this:
> id,history_b,history_a
> user1,iphone ipad,iphone ipad galaxy
> ...
> 2) it joins #2 DRMs to produce a single set of docs in CSV form, which
> looks like this:
> id,b_b_links,b_a_links
> u1,iphone ipad,iphone ipad galaxy
> …
> It may work on a cluster, I haven't tried yet. As soon as someone has some
> large-ish sample log files I'll give them a try. Check the sample input
> files in the resources dir for format.
> On Aug 13, 2013, at 10:17 AM, Pat Ferrel <> wrote:
> When I started looking at this I was a bit skeptical. As a Search engine
> Solr may be peerless, but as yet another NoSQL db?
> However getting further into this I see one very large benefit. It has one
> feature that sets it completely apart from the typical NoSQL db. The type
> of queries you do return fuzzy results--in the very best sense of that
> word. The most interesting queries are based on similarity to some
> exemplar. Results are returned in order of similarity strength, not ordered
> by a sort field.
> Wherever similarity based queries are important I'll look at Solr first.
> SolrJ looks like an interesting way to get Solr queries on POJOs. It's
> probably at least an alternative to using docs and CSVs to import the data
> from Mahout.
> On Aug 12, 2013, at 2:32 PM, Ted Dunning <> wrote:
> Yes.  That would be interesting.
> On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan <> wrote:
>> A little digression: Might a Matrix implementation backed by a Solr index
>> and uses SolrJ for querying help at all for the Solr recommendation
>> approach?
>> It supports multiple fields of String, Text, or boolean flags.
>> Best
>> Gokhan
>> On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel <> wrote:
>>> Also a question about user history.
>>> I was planning to write these into separate directories so Solr could
>>> fetch them from different sources but it occurs to me that it would be
>>> better to join A and B by user ID and output a doc per user ID with
> three
>>> fields, id, A item history, and B item history. Other fields could be
>> added
>>> for users metadata.
>>> Sound correct? This is what I'll do unless someone stops me.
>>> On Aug 7, 2013, at 11:25 AM, Pat Ferrel <> wrote:
>>> Once you have a sample or example of what you think the
>>> "log file" version will look like, can you post it? It would be great to
>>> have example lines for two actions with or without the same item IDs.
>> I'll
>>> make sure we can digest it.
>>> I thought more about the ingest part and I don't think the
> one-item-space
>>> is actually a problem. It just means one item dictionary. A and B will
>> have
>>> the right content, all I have to do is make sure the right ranks are
>> input
>>> to the MM,
>>> Transpose, and RSJ. This in turn is only one extra count of the # of
>> items
>>> in A's item space. This should be a very easy change If my thinking is
>>> correct.
>>> On Aug 7, 2013, at 8:09 AM, Ted Dunning <> wrote:
>>> On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel <>
> wrote:
>>>> 4) To add more metadata to the Solr output will be left to the consumer
>>>> for now. If there is a good data set to use we can illustrate how to do
>>> it
>>>> in the project. Ted may have some data for this from musicbrainz.
>>> I am working on this issue now.
>>> The current state is that I can bring in a bunch of track names and
> links
>>> to artist names and so on.  This would provide the basic set of items
>>> (artists, genres, tracks and tags).
>>> There is a hitch in bringing in the data needed to generate the logs
>> since
>>> that part of MB is not Apache compatible.  I am working on that issue.
>>> Technically, the data is in a massively normalized relational form right
>>> now, but it isn't terribly hard to denormalize into a form that we need.

View raw message