mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From First Qaxy <>
Subject Re: Algorithm scalability
Date Wed, 05 May 2010 14:59:39 GMT
Thanks Sean. This is the information I was looking for and I'll start digging. Yes I do understand
at this point the distributed vs non-distributed pros and cons. The distributed ones allow
you to scale well given you have enough well connected nodes and doesn't have a memory constrain.
However there's a performance price to pay. The non distributes ones are faster but bounded
by the memory available on the machine.
Out of curiosity - sorry if this have been answered before - would it be possible to combine
the two approaches so you could break the data set in batches that could fit in memory and
use a non-distributed algorithm to provide results for each batch and then use Hadoop to merge
the results in a sensible way? This would improve performance while scaling (this is different
than the pseudo approach where you simply distribute the work on the same model). I didn't
give it much though but I think this might work some limited cases. 
--- On Wed, 5/5/10, Sean Owen <> wrote:

From: Sean Owen <>
Subject: Re: Algorithm scalability
Received: Wednesday, May 5, 2010, 6:56 AM

On Wed, May 5, 2010 at 11:38 AM, First Qaxy <> wrote:
> With regards to the use of the in-memory algorithms - I was under the impression that
those would not work on this model. Is there a rule of thumb that connects the model characteristics
to the resources needed to run an in-memory algorithm? In this case I assume that 10 million
significant occurrences come from a much larger set of item-to-item matrix after applying
a min_support threshold or similar. Is this

My crude guidance is that with up to 100M data points (preferences),
you can probably get the entire data set into memory. Meaning, this is
roughly what fits in a very large but not massive heap (like 4GB).

It's not reasonable to fit a couple billion into memory unless you
really want to play with a 64GB heap or something (which you could!)

But I'm suggesting algorithms that don't need the whole data model in
memory, like slope one. All it really needs in memory to perform well
are the item-item average preference diffs. One diff per item-item
pair may be just fine to fit in memory if the number of items is
relatively not large -- and this is a situation where you can throw
away diffs without really compromising the computation much.

These take a lot of work to compute, but you can distribute that part
and recompute it periodically (there is already a MapReduce available
for this).

It still needs access to potentially the data to compute a
recommendation, so it would need to be accessible from a database or
something, not in memory. If that's feasible, this works.

>  size of the item-to-item determining the memory requirements for the algorithm? Also
is memory needed to process the full item-to-item matrix or only the final one with the threshold
applied?If I would have 1 bln items in the matrix what would the algorithm's memory footprint
be? 20Gb? Again, if there's a best practices available to link the characteristics of a model
with the algorithms viability - that would be extremely useful.

We're still talking about non-distributed possibilities here? because
in the distributed version there is no memory issue. The computation
is chopped up so much that no one piece of it needs all the data at
once. (This has its own price of course.)

In general, like my slope-one suggestion above, yes there are several
non-distributed possibilities of this form.

You can always leave all your data in, say, a database. However the
algorithms are so data-intensive that it will be unreasonably slow to
leave it at that.

You're examining item-based algorithms, it seems, of which
co-occurrence-based approaches are a particular type, and which are a
close cousin to slope-one. In those, almost all of the data access
comes from computing item-item values (co-occurrence, similarity,
average pref diff, etc.)

If you can precompute those values, prune them, and store them in
memory, then performance gets reasonable again, even without all data
in memory.

And this sort of approach is feasible when the number of items is
relatively low, or else the number of item-item pairs gets too large
and again you run out of memory.

> Currently I'm storing the full item-to-item matrix to support future incremental update
of the model. Could this somehow be done in Mahout or is a full run required every time?

(Here we're talking distributed again?)

If you're talking about a co-occurrence matrix, yeah I imagine it's
not hard to incrementally update it. There's no M/R for that but it's
easy to write. It operates on the co-occurrence matrix produces by
UserToCooccurrenceReducer, which is just a SequenceFile full of itemID
mapped to Vector of co-occurrences.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message