mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank Scholten <fr...@frankscholten.nl>
Subject Re: Setting up a recommender
Date Mon, 21 Apr 2014 19:11:44 GMT
Pat and Ted: I am late to the party but this is very interesting!

I am not sure I understand all the steps, though. Do you still create a
cooccurrence matrix and compute LLR scores during this process or do you
only compute matrix multiplication times the history vector: B'B * h and
B'A * h?

Cheers,

Frank


On Tue, Aug 13, 2013 at 7:49 PM, Pat Ferrel <pat.ferrel@gmail.com> wrote:

> I finally got some time to work on this and have a first cut at output to
> Solr working on the github repo. It only works on 2-action input but I'll
> have that cleaned up soon so it will work with one action. Solr indexing
> has not been tested yet and the field names and/or types may need tweaking.
>
> It takes the result of the previous drop:
> 1) DRMs for B (user history or B items action1) and A (user history of A
> items action2)
> 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence
>
> There are two final outputs created using mapreduce but requiring 2
> in-memory hashmaps. I think this will work on a cluster (the hashmaps are
> instantiated on each node) but haven't tried yet. It orders items in #2
> fields by strength of "link", which is the similarity value used in [B'B]
> or [B'A]. It would be nice to order #1 by recency but there is no provision
> for passing through timestamps at present so they are ordered by the
> strength of preference. This is probably not useful and so can be ignored.
> Ordering by recency might be useful for truncating queries by recency while
> leaving the training data containing 100% of available history.
>
> 1) It joins #1 DRMs to produce a single set of docs in CSV form, which
> looks like this:
> id,history_b,history_a
> user1,iphone ipad,iphone ipad galaxy
> ...
>
> 2) it joins #2 DRMs to produce a single set of docs in CSV form, which
> looks like this:
> id,b_b_links,b_a_links
> u1,iphone ipad,iphone ipad galaxy
> …
>
> It may work on a cluster, I haven't tried yet. As soon as someone has some
> large-ish sample log files I'll give them a try. Check the sample input
> files in the resources dir for format.
>
> https://github.com/pferrel/solr-recommender
>
>
> On Aug 13, 2013, at 10:17 AM, Pat Ferrel <pat@occamsmachete.com> wrote:
>
> When I started looking at this I was a bit skeptical. As a Search engine
> Solr may be peerless, but as yet another NoSQL db?
>
> However getting further into this I see one very large benefit. It has one
> feature that sets it completely apart from the typical NoSQL db. The type
> of queries you do return fuzzy results--in the very best sense of that
> word. The most interesting queries are based on similarity to some
> exemplar. Results are returned in order of similarity strength, not ordered
> by a sort field.
>
> Wherever similarity based queries are important I'll look at Solr first.
> SolrJ looks like an interesting way to get Solr queries on POJOs. It's
> probably at least an alternative to using docs and CSVs to import the data
> from Mahout.
>
>
>
> On Aug 12, 2013, at 2:32 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
> Yes.  That would be interesting.
>
>
>
>
> On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan <gkhncpn@gmail.com> wrote:
>
> > A little digression: Might a Matrix implementation backed by a Solr index
> > and uses SolrJ for querying help at all for the Solr recommendation
> > approach?
> >
> > It supports multiple fields of String, Text, or boolean flags.
> >
> > Best
> > Gokhan
> >
> >
> > On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
> >
> >> Also a question about user history.
> >>
> >> I was planning to write these into separate directories so Solr could
> >> fetch them from different sources but it occurs to me that it would be
> >> better to join A and B by user ID and output a doc per user ID with
> three
> >> fields, id, A item history, and B item history. Other fields could be
> > added
> >> for users metadata.
> >>
> >> Sound correct? This is what I'll do unless someone stops me.
> >>
> >> On Aug 7, 2013, at 11:25 AM, Pat Ferrel <pat@occamsmachete.com> wrote:
> >>
> >> Once you have a sample or example of what you think the
> >> "log file" version will look like, can you post it? It would be great to
> >> have example lines for two actions with or without the same item IDs.
> > I'll
> >> make sure we can digest it.
> >>
> >> I thought more about the ingest part and I don't think the
> one-item-space
> >> is actually a problem. It just means one item dictionary. A and B will
> > have
> >> the right content, all I have to do is make sure the right ranks are
> > input
> >> to the MM,
> >> Transpose, and RSJ. This in turn is only one extra count of the # of
> > items
> >> in A's item space. This should be a very easy change If my thinking is
> >> correct.
> >>
> >>
> >> On Aug 7, 2013, at 8:09 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
> >>
> >> On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel <pat.ferrel@gmail.com>
> wrote:
> >>
> >>> 4) To add more metadata to the Solr output will be left to the consumer
> >>> for now. If there is a good data set to use we can illustrate how to do
> >> it
> >>> in the project. Ted may have some data for this from musicbrainz.
> >>
> >>
> >> I am working on this issue now.
> >>
> >> The current state is that I can bring in a bunch of track names and
> links
> >> to artist names and so on.  This would provide the basic set of items
> >> (artists, genres, tracks and tags).
> >>
> >> There is a hitch in bringing in the data needed to generate the logs
> > since
> >> that part of MB is not Apache compatible.  I am working on that issue.
> >>
> >> Technically, the data is in a massively normalized relational form right
> >> now, but it isn't terribly hard to denormalize into a form that we need.
> >>
> >>
> >>
> >
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message