Hey Sebastian, Thanks for all the information, that was very helpful. One question, when you said "as large as the maximum number of interactions per user or larger" for the macPrefsPerUser property does that refer if the algorithm was comparing your items and my items it would be looking at how many of all of our items we have in common? So if we both have 500 items total, 100 of which are in common in both sets, would the maxPrefsPerUser property cap how many items it looks at of that 100 in common? You said this should be set to a very high number, however, in the code it appears this value is defaulted to 10. What do you consider very high; 100, 1,000, 10,000? We are currently using 0.7 because that is what came with our version of HDP. Should we upgrade to trunk of 0.8? Would we gain any performance improvements or algorithm improvements? Thanks so much! Brian On Thu, Sep 12, 2013 at 11:01 AM, Sebastian Schelter wrote: > Hi Brian, > > Happy to give you some details: > So, from a matrix A (user x item) that holds user-item interactions, > this algorithm first computes a matrix S (item x item) of item > similarities and afterwards uses these item similarities to compute > recommendations for users. > > the parameters refer to the following: > > 'maxPrefsPerUserInItemSimilarity' the maximum number of interactions per > user to take into account when computing S (e.g. the maximum number of > entries to look at per row in A, selected at random). Single power-users > with an anomalous number of interactions can heavily increase the > computation time, without contributing to the actual quality of the > output. Setting this to something like 500 should give you reasonable > performance and results. > > 'maxSimilaritiesPerItem' this number determines the maximum number of > similar items to look at per item (e.g. the maximum number of entries > per row in S). Research papers reported good results with something > between 20 and 100. > > 'maxPrefsPerUser': this number determines how many interactions per user > to take into account in the final recommendation phase. This thing is > probably bugged and should be set to a very high number (as large as the > maximum number of interactions per user or larger) otherwise you might > see items in the recommendations that the user already knows. > > In general, the only way to get a picture of the quality of a > recommender is by doing tests in a live system with real users. You can > of course do some hold-out tests or cross-validation offline, but good > performance there does not necessarily correlate with good performance > in a real system. > > I suggest you start by using the default values, do you use trunk or 0.8? > > Best, > Sebastian > > > > 2013/9/11 Brian Arnold > > > Hi, > > > > I am currently trying to run the distributed Item Based Collaborative > > filtering algorithm on our Hadoop cluster, and I have a few questions > > regarding tweaking the various properties of the algorithm. For the > > maxPrefsPerUser,maxSimilaritiesPerItem, and maxPrefsPerUserItemSimilarity > > properties I was wondering if I could get a more detailed explanation of > > what these properties control. I saw the description in the code, but I > am > > just wondering how changing these values will affect the results of the > > algorithm, and will increasing them result in a better recommendation. > > > > Thanks > > > > >