mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pasmanik, Paul" <>
Subject RE: mahout 1.0 on EMR with spark item-similarity
Date Tue, 07 Apr 2015 14:01:44 GMT
Thanks, Pat.
We are only running EMR cluster with 1 master and 1 core node right now and were using EMR
AMI  3.2.3 which has Hadoop 2.4.0.  We are using default configuration for spark (using aws
script for spark) which I believe sets number of instances to 2.  Spark version 1.1.0h  (

We are not in production yet as we are experimenting right now.   

I have a question about the choice of the search engine to do recommendations.
I know the Practical Machine Learning book and mahout docs talk about Solr.  Do you see any
issues with using Elastic Search or AWS Cloud Search?  
Also, looking at the content based indicator example on intro-cooccurrence-spark mahout page
I see that spark-rowimilairity job is used to produce itemid to items matrix, but then it
says to use tags associated with purchases in the query for tags like this:
  field: purchase; q:user's-purchase-history
  field: view; q:user's view-history
  field: tags; q:user's-tags-associated-with-purchases

So, we are not providing the actual tags in the tags field query, are we?


-----Original Message-----
From: Pat Ferrel [] 
Sent: Monday, April 06, 2015 2:33 PM
Subject: Re: mahout 1.0 on EMR with spark item-similarity

OK, this seems fine. So you used "-ma yarn-client”, I’ve verified that this works in other

BTW we are nearing a new release. It fixes one cooccurrence problem that you may run into
if you are using lots of small files to contain the initial interaction input. This happens
often when using Spark Streaming for input.

If you want to try the source on github make sure to compile with -DskipTests since there
is a failing test unrelated to the Spark code. Be aware that jar names have changed if that

Can you report the cluster version of Spark and Hadoop as well as how many nodes?


On Apr 6, 2015, at 11:19 AM, Pasmanik, Paul <> wrote:

Pat, I was not using spark-submit script.  I am using mahout spark-itemsimilarity exactly
how it is specified in

So, what I did is I created a bootstrap action that installs spark and mahout on EMR cluster.
 Then, I used AWS Java APIs to create an EMR job step which can call a script (amazon provides
scriptRunner that can run any script).  So, I basically create a command (mahout spark-itemsimilarity
<parameters>) and pass it to script runner that runs it. One of the parameters is -ma
, so I pass in yarn-client.   

We use AWS java API to programmatically start EMR cluster (trigger by Quartz job) with whatever
parameters that job needs.
I used instructions in here:
 to install spark as bootstrap action.  I built mahout-1.0 locally and uploaded a package
to s3. I also created a bash script to copy that package from s3 to EMR, unpack, remove mahout
0.9 version that is part for EMR ami.  Then I used another boostrap action to invoke that
script  and install mahout.  I had to also make changes to mahout script.   Added SPARK_HOME=/home/hadoop/spark

(this is where I installed spark on EMR). Modified CLASSPATH=${CLASSPATH}:$MAHOUT_CONF_DIR
to CLASSPATH=$MAHOUT_CONF_DIR to avoid including classpath passed in by amazon script-runner
since it contains path to the 2.11 version of scala (installed on EMR by Amazon) that conflicts
with spark/mahout 2.10.x version.

-----Original Message-----
From: Pat Ferrel [] 
Sent: Thursday, March 26, 2015 3:49 PM
Subject: Re: mahout 1.0 on EMR with spark item-similarity

Finally getting to Yarn. Paul were you trying to run spark-itemsimilarity with the spark-submit
script? That shouldn’t work, the job is a standalone app and does not require, nor is it
likely to work with spark-submit.

Were you able to run on Yarn? How?

On Jan 29, 2015, at 9:15 AM, Pat Ferrel <> wrote:

There are two indices (guava HashBiMaps) that map your ID into and out of Mahout IDs (HashBiMap<int,
string>). There is one copy of each (row/user IDs and column/itemIDS) per physical machine
that all local tasks consult. They are Spark broadcast values. These will grow linearly as
the number of items and users grow and as the size of your IDs, treated as strings, grow.
The hashmaps have some overhead but in large collections the main cost is the size of the
application IDs stored as strings, Mahout’s IDs are ints.

On Jan 22, 2015, at 8:04 AM, Pasmanik, Paul <> wrote:

I was able to get spark and mahout installed on EMR cluster as bootstrap actions and was able
to run spark-itemsimilarity job via an EMR step with some modifications to mahout script (defining
SPARK_HOME and making sure CLASSPATH is not picked up from the invoking script  which is amazon's

I was only able to run this job using yarn-client (yarn-master is not able to submit to resource

In yarn-client mode the driver program runs in the client process and submits jobs to executors
via yarn manager, so my question is how much memory does this driver need?
Will the memory requirement vary based on the size of the input to spark-itemsimilarity?


-----Original Message-----
From: Pasmanik, Paul [] 
Sent: Thursday, January 15, 2015 12:46 PM
Subject: mahout 1.0 on EMR with spark

Has anyone tried running mahout 1.0 on EMR with Spark?
I've used instructions at
to get EMR cluster running spark.   I am now able to deploy EMR cluster with Spark using AWS
EMR allows running a custom script as bootstrap action which I can use to install mahout.
What I am trying to figure out is whether I would need to build mahout every time I start
EMR cluster or have pre-built artifacts and develop a script similar to what awslab is using
to install spark?


The information contained in this electronic transmission is intended only for the use of
the recipient and may be confidential and privileged. Unauthorized use, disclosure, or reproduction
is strictly prohibited and may be unlawful. If you have received this electronic transmission
in error, please notify the sender immediately.

View raw message