I have a related question about the Indicator Matrix. Is it possible to
compute it using either quantitative ratings; or maybe just good ratings
taken a single action (user 1 "liked" product 1). I am referring to
"Practical Machine Learning Innovations in Recommendation" where you say
that "The best choice of data may surprise you—it’s not user ratings [...]".
So basically, was this recommender designed specifically not for
quantitative ratings, or is this just an empiric observation that visits
works better than ratings in order to produce an indicator matrix leading
to the best recommendations?
On Thu, May 29, 2014 at 4:54 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> I really think that a hard limit on number of indicators is just fine. The
> points that I have seen raised regarding this include:
>
> a) this doesn't limit total size of indicator matrix.
>
> I agree with this. It doesn't. And it shouldn't. It does limit that size
> per item which is really better for operational use.
>
> b) an average would be better
>
> Why? The hard limit winds up limiting almost all items to exactly the
> limit. This means that this is very nearly the average.
>
>
>
>
> On Wed, May 28, 2014 at 8:31 AM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
>
> > That’s what I thought, also why the total number of indicators is not
> > limitable, right?
> >
> > For the Spark version, should we allow something like an average number
> of
> > indicators per item? We will only be supporting LLR with that and as Ted
> > and Ken point out that is the interesting thing to limit. It will mean a
> > nontrivial bit of added processing if specified, obviously.
> >
> > On May 27, 2014, at 12:00 PM, Sebastian Schelter <ssc@apache.org> wrote:
> >
> > I have added the threshold merely as a way to increase the performance of
> > RowSimilarityJob. If a threshold is given, some item pairs don't need to
> be
> > looked at. A simple example is if you use cooccurrence count as
> similarity
> > measure, and set a threshold of n cooccurrences, than any pair containing
> > an item with less than n interactions can be ignored. IIRC similar
> > techniques are implemented for cosine and jaccard.
> >
> > Best,
> > Sebastian
> >
> >
> >
> > On 05/27/2014 07:08 PM, Pat Ferrel wrote:
> > >>
> > >> On May 27, 2014, at 8:15 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> > >>
> > >> The threshold should not normally be used in the Mahout+Solr
> deployment
> > >> style.
> > >
> > > Understood and that’s why an alternative way of specifying a cutoff may
> > be a good idea.
> > >
> > >>
> > >> This need is better supported by specifying the maximum number of
> > >> indicators. This is mathematically equivalent to specifying a
> fraction
> > of
> > >> values, but is more meaningful to users since good values for this
> > number
> > >> are pretty consistent across different uses (50100 are reasonable
> > values
> > >> for most needs larger values are quite plausible).
> > >
> > > Assume you mean 50100 as the average number per item.
> > >
> > > The total for the entire indicator matrix is what Ken was asking for.
> > But I was thinking about the use with itemsimilarity where the user may
> not
> > know the dimensionality since itemsimilarity assembles the matrix from
> > individual prefs. The user probably knows the number of items in their
> > catalog but the indicator matrix dimensionality is arbitrarily smaller.
> > >
> > > Currently the help reads:
> > > maxSimilaritiesPerItem (m) maxSimilaritiesPerItem try to cap the
> > number of similar items per item to this number (default: 100)
> > >
> > > If this were actually the average # per item it would do what you
> > describe but it looks like it’s a literal a cutoff per vector in the
> code.
> > >
> > > A cutoff based on the highest scores in the entire matrix seems to
> imply
> > a sort when the total is larger than the average would allow and I don’t
> > see an obvious sort being done in the MR.
> > >
> > > Anyway, it looks like we could do this by
> > > 1) total number of values in the matrix (what Ken was asking for) This
> > requires that the user know the dimensionality of the indicator matrix to
> > be very useful.
> > > 2) average number per item (what Ted describes) This seems the most
> > intuitive and does not require the dimensionality be known
> > > 3) fraction of the values. This might be useful if you are more
> > interested in downsampling by score, at least it seems more useful than
> > —threshold as it is today but maybe I’m missing some use cases? Is there
> > really a need for a hard score threshold?
> > >
> > >
> > >>
> > >>
> > >> On Tue, May 27, 2014 at 8:08 AM, Pat Ferrel <pat.ferrel@gmail.com>
> > wrote:
> > >>
> > >>> I was talking with Ken Krugler off list about the Mahout + Solr
> > >>> recommender and he had an interesting request.
> > >>>
> > >>> When calculating the indicator/item similarity matrix using
> > >>> ItemSimilarityJob there is a threshold option. Wouldn’t it be
> > better to
> > >>> have an option that specified the fraction of values kept in the
> entire
> > >>> matrix based on their similarity strength? This is very difficult to
> do
> > >>> with threshold. It would be like expressing the threshold as a
> > fraction
> > >>> of total number of values rather than a strength value. Seems like
> this
> > >>> would have the effect of tossing the least interesting similarities
> > where
> > >>> limiting per item (—maxSimilaritiesPerItem) could easily toss some
of
> > the
> > >>> most interesting.
> > >>>
> > >>> At very least it seems like a better way of expressing the threshold,
> > >>> doesn’t it?
> > >>
> >
> >
> >
>
