mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From First Qaxy <>
Subject Re: Algorithm scalability
Date Wed, 05 May 2010 10:38:46 GMT
This is the most extreme case. An large auto-parts store targeting mainly auto mechanics will
have data showing different distribution patterns over years than lets say a t-shirt store.
At this point I'm estimated/speculating  based on a limited dataset set I've acquired so
far. If you or anyone else has or knows of better statistics(variation, extremes) that would
be extremely helpful.
With regards to the use of the in-memory algorithms - I was under the impression that those
would not work on this model. Is there a rule of thumb that connects the model characteristics
to the resources needed to run an in-memory algorithm? In this case I assume that 10 million
significant occurrences come from a much larger set of item-to-item matrix after applying
a min_support threshold or similar. Is this
 size of the item-to-item determining the memory requirements for the algorithm? Also is memory
needed to process the full item-to-item matrix or only the final one with the threshold applied?If
I would have 1 bln items in the matrix what would the algorithm's memory footprint be? 20Gb?
Again, if there's a best practices available to link the characteristics of a model with the
algorithms viability - that would be extremely useful.
Currently I'm storing the full item-to-item matrix to support future incremental update of
the model. Could this somehow be done in Mahout or is a full run required every time? 
Thanks for your time.-qf
--- On Tue, 5/4/10, Ted Dunning <> wrote:

From: Ted Dunning <>
Subject: Re: Algorithm scalability
Received: Tuesday, May 4, 2010, 5:27 PM

This is much denser than I would expect.  You are saying that you would have
an average of 1000 transactions per user.  It is more normal to have 100
less.  If you have these smaller sizes, then in-memory algorithms on a
single (large) machine begin to be practical.

On Tue, May 4, 2010 at 2:01 PM, Sean Owen <> wrote:

> On Tue, May 4, 2010 at 9:53 PM, First Qaxy <> wrote:
> > Purely based on estimates, assuming 5 billion transactions, 5 million
> users, 100K products normally distributed are expected to create a sparse
> item to item matrix of up to 10 Million significant co-occurrences
> (significance is not globally defined but in the context of the active item
> to recommend from; in other words support can be really tiny, confidence
> less so).

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message