mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: itemsimilarity - maxPrefs parameter
Date Fri, 12 Dec 2014 16:52:57 GMT
Increase the number to Integer.max or the highest of your number of users or items. The “or"
means that the row and columns are both downsampled to that number or less.

To use all data you will also have to increase the —maxSimilaritiesPerItem

There are two marices in the Hadoop itemsimilarity. The input is A, and is one row per user
with each item the user has interacted with. From this AtA is calculated as the output using
LLR instead of actual matrix multiplication. This yields an AtA with values weighted but LLR
strength. —maxSimilaritiesPerItem will further limit the values here to no more than that
number. There is also a quality threshold, which is pretty difficult to use.

If you remove all of these downsampling params you will approach O(n^2) runtime, if you use
them you will have O(n). You will also get rapidly diminishing returns by removing downsampling.

The indicator matrix will have arbitrarily many similar items of diminishing strength, some
could be nearly useless. This potentially large vector may be unwieldy in you other calculations
and has not had low value similar items filtered out.

Bottom line it that the downsampling is possible to tweak but removal altogether is not likely
to be a good thing.


On Dec 12, 2014, at 6:18 AM, Gruszowska Natalia <Natalia.Gruszowska@grupaonet.pl> wrote:

Hi All, 

In itemsimilarity metod tere is a parameter like:

--maxPrefs (-mppu) maxPrefs                               max number of
                                                         preferences to
                                                         consider per user or
                                                         item, users or items
                                                         with more preferences
                                                         will be sampled down
                                                         (default: 500)

How does it work exactly?
If I have 5 mln users and 5000 items and I run itemsimilarity with default maxPrefs, it consider
only 500 ranks from those 5 mln or what? Is it sampling? What can I do to force calculation
for all input data? 

			M1   M2   M3 .... M5000
U_1
U_2
...
U_5000000

What does mean "or" in definition:
"max number of preferences to consider per user or item"


Thx in advance
Natalia




Mime
View raw message