mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Realtime update of similarity matrices
Date Mon, 22 Jun 2015 00:24:20 GMT
It is possible to do but not implemented anywhere afaik. Streaming and online/incremental model
calcs are different. Plain Streaming recalcs the model on a moving time window but does so
very often, online/incremental treats the model as a mutable thing and modifies it in place.
As you can imagine they require very different methods. Ted’s reference points out that
the internal LLR weighted cooccurrence calc is possible to do online because there is a #
of cooccurrences cutoff that means many new interactions are not going to affect the model
and LLR is a very simple calc not involving the entire row or column vectors only their non-zero
element counts, which are easy to keep in memory (one vector each)

It’s relatively simple to set up Mahout’s item and row similarity to take streams and
recalc at rapid intervals. I’ve done this with Kafka to Spark streaming input. This uses
an entire time window’s worth of data and so is not incremental but since the calc is fast
and O(n) can be scaled with size of Spark cluster. The cooccurrence and cross-cooccurrence
calc can be done on the public epinions data on my laptop in 12 minutes. This is a smallish
dataset.

But may I ask why you want online/incremental? There are only a few edge cases that benefit
from this and as Ted points out there may be very few interactions that will modify the model
at all.

The reasons to update a model are:
1) new items are added. Actually only when new items have some number of interactions. How
often is your item collection changing? If you have a very popular newspaper and the items
changed by the minute this might be a case where very rapid model updates would benefit you.
2) the characteristics of interactions change very rapidly. So this is where users are changing
preferences very often. I have never personally run into this case but imagine there are examples
in social media.

The Multimodal recommender can handle new users that have some usage history but were not
used in the model calc so new users are not a case where you need incremental model updates.


On Jun 19, 2015, at 3:46 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

The standard approach is to re-run the off-line learning.

It is possible, though not yet supported in Mahout tools, to do real-time
updates.

See here for some details:
https://www.mapr.com/resources/videos/fully-real-time-recommendation-%E2%80%93-ted-dunning-sf-data-mining



On Fri, Jun 19, 2015 at 2:35 AM, James Donnelly <jamesjdonnelly@gmail.com>
wrote:

> Hi,
> 
> First of all, a big thanks to Ted and Pat, and all the authors and
> developers around Mahout.
> 
> I'm putting together an eCommerce recommendation framework, and have a
> couple of questions from using the latest tools in Mahout 1.0.
> 
> I've seen it hinted by Pat that real-time updates (incremental learning)
> are made possible with the latest Mahout tools here:
> 
> 
> http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/
> 
> But once I have gone through the first phase of data processing, I'm not
> clear on the basic direction for maintaining the generated data, e.g with
> added products and incremental user behaviour data.
> 
> The only way I can see is to update my input data,  then re-run the entire
> process of generating the similarity matrices using the itemSimilarity and
> rowSImilarity jobs.  Is there a better way?
> 
> James
> 


Mime
View raw message