mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gokhan Capan <>
Subject Re: Setting up a recommender
Date Mon, 22 Jul 2013 18:56:26 GMT
Just to make sure if I understood correctly, Ted, could you please correct

1. Using a search engine, I will treat items as documents, where each
document vector consists of other items (similar to "words of documents")
with co-occurrence (LLR) weights (instead of tf-idf in a search engine
So for each item I will have a sparse vector that represents the relation
of that item to other items, if there is an indicator that makes the
item-to-item similarity (co-occurrence) non-zero. (I will only use positive
feedback, I think, since I am counting co-occurrences)

2. To present recommendations, the system formulates a "query", with a
history of items --the session history for task based recommendation, or a
long term history. And the search engine will find top-N items, based on
the cosine similarities of the item (document) vectors and history (query)

3. For example, if that was a restaurant recommendation, and we knew that
the restaurant was famous for its sushi, I would index this in another
field, "famous_for".
Now if, as a user, I asked for sushi restaurants that I would enjoy, the
system would add this to query along with my history, and the famous sushi
restaurant would rank higher in results, even if chances are equal that I
would like a steakhouse according to the computation in 2.

4. Since this is a search engine, and a search engine can boost a
particular field, the system would let the "famous_for" overweigh the
collaborative activity, or the opposite (According to the use case, or for
example, number of items in the history) So I can define a weighting
(voting, or mixture of experts) scheme to "blend" different recommenders.

Are those correct?


On Mon, Jul 22, 2013 at 9:07 PM, Michael Sokolov <> wrote:

> On 07/22/2013 12:20 PM, Pat Ferrel wrote:
>> My understanding of the Solr proposal puts B's row similarity matrix in a
>> vector per item. That means each row is turned into "terms" = external
>> IDs--not sure how the weights of each term are encoded.
> This is the key question for me. The best idea I've had is to use termFreq
> as a proxy for weight.  It's only an integer, so there are scaling issues
> to consider, but you can apply a per-field weight to manage that.  Also,
> Lucene (and Solr) doesn't provide an obvious way to load term frequencies
> directly: probably the simplest thing to do is just to repeat the
> cross-term N times and let the text analysis take care of counting them.
>  Inefficient, but probably the quickest way to get going.  Alternatively,
> there are some lower level Lucene indexing APIs (DocFieldConsumer et al)
> which I haven't really plumbed entirely, but would allow for more direct
> loading of fields.
> Then one probably wants to override the scoring in some way (unless TFIDF
> is the way to go somehow??)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message