mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manuel Blechschmidt <Manuel.Blechschm...@gmx.de>
Subject Re: Recommender Streaming with EMR
Date Mon, 25 Nov 2013 16:23:06 GMT
Hi Bryan,

On 25.11.2013, at 17:14, Bryan Marble wrote:

> Hello - 
> 
> If this isn't the best forum to ask, please let me know.

This is the correct forum to ask this question.

> 
> TL;DR;
> Is there a way to stream preference/user data to an EMR recommender workflow without
having to go through the pain of re-uploading all preference data, and starting brand new
jobs over and over, etc?

No, currently not. Streaming machine learning is current research. Currently you always train
your model based on all the data that you have and use it afterwards. After some time you
retrain.

> 
> I am trying to process large volumes of preference data using Amazon EMR.  It seems extremely
unscalable to upload our entire preference set every time we run a job

Why? Sending 1TB to EMR will take about 3,7 hours according to the following blog post:
http://www.rightscale.com/blog/cloud-industry-insights/network-performance-within-amazon-ec2-and-amazon-s3

If you use compression you can stream around 10 times the amount.

> , as the vast majority of the preferences will never change.

Just append them.

> It seems like the append files that Mahout can process would be perfect for this, but
it doesn't appear that EMR supports it.

The ItemSimilarityJob can already read multiple files:
https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.html
--input (path): Directory containing one or more text files with the preference data

> 
> The brute force method appears to be:
> 1) Upload preference set
> 2) Run Recommender job
> 3) Download and process results
> 4) Go to step 1
> 
> Does anyone have some general advice for processing recommendations in as real-time a
manner as possible using EMR?

For better advice you can contact companies like Cloudera, MapR or Apaxo (my company).

> 
> Thank you for any help or references you could provide.
> 
> Bryan Marble
> 

/Manuel

-- 
Manuel Blechschmidt
M.Sc. IT Systems Engineering
Dortustr. 57
14467 Potsdam
Mobil: 0173/6322621
Twitter: http://twitter.com/Manuel_B


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message