mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Algorithm scalability
Date Tue, 04 May 2010 21:01:17 GMT
On Tue, May 4, 2010 at 9:53 PM, First Qaxy <> wrote:
> Purely based on estimates, assuming 5 billion transactions, 5 million users, 100K products normally distributed
are expected to create a sparse item to item matrix of up to 10 Million significant co-occurrences
(significance is not globally defined but in the context of the active item to recommend
from; in other words support can be really tiny, confidence less so).

Sounds like a pretty solid size of a data set. I think the recommender
will work fine on this -- well, suppose it depends on your
expectations but this whole piece has been completely revised recently
and I feel that it's tuned nicely now.

> A few questions:- In 0.3 there was also a
that I canot find in the latest trunk. Was this merged into RecommenderJob?Is there any example
or unit test for the hadoop.item.RecommenderJob?- is there any more documentation

This has been merged in* as
part of this complete overhaul.

>  on hadoop.pseudo? I am still not clear how that is broken into chunks in the case of
larger models and how the results are being merged afterwards.- for clustering - if I want
to create a few hundred user clusters - is that doable on a model similar to the one described
above, based on boolean preferences?

For this scale, I don't think you can use the pseudo-distributed
recommender. It's just too much data to get onto individual machines'
memory. In this case nothing is broken down, since non-distributed
algorithms generally use all data. It's just that one non-distributed
recommender is cloned N times so you can crank out recommendations N
times faster very easily.

... well since you don't have all that many items, I could imagine one
algorithm working: slope-one. You would need to use Hadoop to compute
the item-item diffs ahead of time, and prune it. But a pruned set of
item-item diffs fits in memory. You could go this way.

But I think this is the sort of situation very well suited to the
properly distributed implementation.

View raw message