mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: Question about Mahout spark-itemsimilarity
Date Wed, 10 Dec 2014 19:02:52 GMT
The idea is to put each of the indicators in a doc or db row with a field for each indicator.

If you index docs as csv files (not DB columns) with a row per doc you’ll have the following
for [AtA]

doc-ID,item1-ID item2-ID item10-ID  and so on

The doc-ID field contains the item-ID and the indicator field contains a space delimited list
of similar items

You add a field for each of the cross-cooccurrence indicators, you calculate [AtA] with every
run if the job, but [AtB] will be with a different action and so the IDs for A and B  don’t
have to be the same. For instance if you have a user’s location saved periodically you could
treat the action of “being at a location” as the input to B. You run A and B into the
job and get back two text csv sparse matrices. Combine them and index the following:

combined csv:
item1000-ID,item1-ID item2-ID item10-ID ...,location1-ID location2-ID …

This has two indicators in two fields created by joining  [AtA] and  [AtB] files. The item/doc-ID
for the row is from A always—if you look at the output that much should be be clear. Set
up Elasticsearch to index the docs with two fields.

Then the query uses the history of the user that you want to make recommendations for.

Don’t know Elasticsearch query format but the idea is to use the user’s history of items
interacted with and locations as the query.

Q: field1:”item1-ID item2-ID item3-ID” field2:”location1-ID locationN-ID”
field1 contains the data from the primary action indicator and the query string is the user’s
history of the primary action. Field2 contains the location cross-cooccurrence indicators
and so will be location-IDs, the query will be the user’s history of locations. The query
will result in an ordered list of items to recommend. So two fields, two different histories
for the same user. There will be one combined two field query. 

The items IDs may also be the same. For instance if you were recording purchases and detail-views
for an ecom site. You’d feed purchase history in as A, and detail view history in as B.

Other answers below

On Dec 10, 2014, at 7:25 AM, Ariel Wolfmann <> wrote:

Hi Pat, 

Sorry that i continue bothering you, but in the next days i need to finish my annual report,
so it would be very helpful for me if you can answer the previous email, because there are
things that i haven't clear enough.




2014-12-03 12:41 GMT-03:00 Ariel Wolfmann < <>>:
Hi Pat, 

Thank you for your answers, for now it's ok for me to run Spark local, so now i have another

Once i ran spark-itemsimilarity, i need to make the recommendation, as i read in your blog
i can do it, using a Search engine as the similarity engine, i chose ElasticSearch, but i
can't understand how this works to make the recommendations. It needs to calculate rp = hp[AtA]
where [AtA] is the spark-itemsimilarity output (the output will be that only if i set to calculate
all the similarities, not just top 100, right?)

hp[AtA] is actually what the search engine is doing. It is calculating the cosine based similarity
of your query with the indicator field you point it to. This is very loosely speaking = hp[AtA]

The number returned is defined in your config for the search engine and may even allow batched
up cursor type queries—again I haven’t used Elasticsearch. But it’s a search engine
thing, not defined by the algorithm

In most cases you will want to filter out the recommended items that were actually known by
the user. So usually you don’t want to show them things they have already purchased.

how should i need to index this output? i supose that i need to store the similarity calculated
by mahout with the itemId as the indicator, but in the example in the blog the indicator are
just the names of the most similar items. If i only pay atention just to the most similar
items, this will not calculate hp[AtA], for example if there is an item that is not in the
top similar items for the items that the user view, but have enough similarity to be a good
recommendation when you sum the similarities from this item to the viewed items..

Nope, as described above the indicator is just the most similar items (based on which users
preferred them) for the simple case. The IDs are actually strings and it’s assumed they
have no spaces in them so the search engine won’t get confused and break one ID into multiple

The number of similar items in the indicator is configurable but it’s a good start to leave
the number as the default of 500. There are very quickly diminishing returns from using more
and by “downsampling” based on strength of similarity you tend to get the “best” similar
items in that 500. Setting this too high will cause a very longer runtime and will give very
little return in quality.

The code i wrote for ElasticSearch is like this: <>

Also i read that there are some similarity modules for ElasticSearch, but is not clear for
me if this can help or not.

As you can see, i'm starting to learn about ElasticSearch or any other NoSQL search engine,
so this part of the "tutorial" is not clear enough for me.

Another question but less important:
How to do this with more than one cross-similarity, in the blog says that you can run spark-itemsimilarity
several times with the differents kind of interactions, but how should i combine this results
to make a multiple cross recommendation? 
Answered above

I'll really appreciate your answer



2014-11-14 14:23 GMT-03:00 Pat Ferrel < <>>:

This is a Spark error, not hadoop. Some temp file that spark creates is not found in the raw
file system. 

We are having a lot of problems with running the downloaded jars for spark. When you go from
local[4] (where you are running Spark as it was linked in by maven) to spark://ssh-aw:7077
<> (where you launch spark with your local machine as master and slave) you are using
two different sets of code.

Do you build Spark? What version of Hadoop are you using?

BTW I don’t understand why you are doing these things. You get no benefit. The jobs will
complete much faster if you use local[4] than if you go through psuedo-distributed Spark and

It’s true that you’ll see a similar error if you ran this on a real cluster so eventually
you’ll need to solve the problem.

Try the process described in the second FAQ entry here, where you build Spark locally: <>
The Mahout build now defaults to hadoop 2 <>

On Nov 14, 2014, at 7:02 AM, Ariel Wolfmann < <>>

Hi Pat

Thank you very much for your help, after removing spaces it work fine in standalone mode.

With Hadoop In Pseudo distributed mode, it also work fine running this command: 
./bin/mahout spark-itemsimilarity --input hdfs://localhost:9000/user/ariel/intro_multiple.csv
<> --output hdfs://localhost:9000/user/ariel/multiple <>  --master local[4] --filter1
purchase --filter2 view --itemIDColumn 2 --rowIDColumn 0 --filterColumn 1

But trying also with Spark in Pseudo distributed mode i got an error:
The commands:
export MASTER=spark://ssh-aw:7077 <>
/bin/mahout spark-itemsimilarity --input hdfs://localhost:9000/user/ariel/intro_multiple.csv
<> --output hdfs://localhost:9000/user/ariel/outsis2 <>  --master spark://ssh-aw:7077
<> --filter1 purchase --filter2 view --itemIDColumn 2 --rowIDColumn 0 --filterColumn

The error: 
INFO scheduler.DAGScheduler: Failed to run collect at TextDelimitedReaderWriter.scala:76
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4,
ssh-aw): /tmp/spark-local-20141114143728-65fb/0d/shuffle_0_0_1
(No such file or directory)

On /tmp/ related to spark i have these files:

It seems that may be i need to set up any other variables to run Spark pseudo distributed?

I just run  ~/spark-1.1.0/sbin/  

Running with Hadoop Pseudo distributed, for now is good for me, because i'm just playing with
small examples, so if you think that is a simple error tell me how to solve it, if not, this
is enough for me. Also i take your tip for run hadoop standalone is faster than pseudo distributed.

Once again, thank you very much!



2014-11-12 14:31 GMT-03:00 Pat Ferrel < <>>:
Let’s solve the empty file problem first. There’s no need to specify the inDelim, which
is [ ,\t] by default (that’s a space, comma, or tab). Also it looks like you have a space
after each comma which would make the filters not match what’s in the file. If so remove
all spaces and try again. In delimited text files every character is significant.

If there aren’t any spaces can you send me intro_multiple.csv?

On Nov 12, 2014, at 8:39 AM, Ariel Wolfmann < <>>

Hi Pat,

First of all, thank you for your quickly answer and sory for the late response, i was having
problems with the memory on my laptop, so i get access to a server (in a docker).

I've pulled the last commit on the mahout's github, and updated to Spark 1.1.0. 

Running locally with MAHOUT_LOCAL=true, and without starting any service from Hadoop and Spark
(that means running standalone mode), i still getting empty output (files  part-00000 empty
I've tried with this line:
./bin/mahout spark-itemsimilarity --input intro_multiple.csv --output multiple  --master local[4]
--filter1 purchase --filter2 view --itemIDColumn 2 --rowIDColumn 0 --filterColumn 1 --inDelim

Running in Pseudo distributed mode (starting dfs, yarn and spark all), adding hdfs://localhost:9000/
<> to the files i got:
 INFO scheduler.DAGScheduler: Failed to run collect at TextDelimitedReaderWriter.scala:76

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.4 in stage 1.0 (TID 8,
ssh-aw): ExecutorLostFailure (executor lost)
Driver stacktrace:
	at <>$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)

I've tried with this line:
 ./bin/mahout spark-itemsimilarity --input hdfs://localhost:9000/user/ariel/intro_multiple.csv
<> --output hdfs://localhost:9000/user/ariel/outsis <>  --master spark://ssh-aw:7077
<> --filter1 purchase --filter2 view --itemIDColumn 2 --rowIDColumn 0 --filterColumn
1 --inDelim ,

Could be anything related to Hadoop 2.4.1 or may be the format of the input?
The input file is like:
u1, purchase, iphone
u1, purchase, ipad 

Any Idea?



2014-10-29 14:51 GMT-03:00 Pat Ferrel < <>>:
How recent a build are you using. I suggest the master branch on github but it was just bumped
to use Spark 1.1.0 so be aware.

I’d guess that there is a problem in how you are specifying your input. --input multiple_intro.csv
will look in your current directory if you have Mahout set to run locally or in your HDFS
default directory if you are running Mahout in clustered mode. Check:

> env | grep MAHOUT_LOCAL #that’s a pipe character not the letter “I"

to see if it is set. If it’s not set you are trying to get your file from HDFS. Try specifying
the full HDFS path if using Hadoop in psuedo distributed mode. 

BTW using a master of “local[4]” and setting MAHOUT_LOCAL=true (so Mahout uses local storage,
not HDFS) will run the job much faster on your laptop. You can set the number of cores used
in the “local[n]” to whatever you have. This only matters when using real data—the samples
are tiny.

On Oct 29, 2014, at 8:47 AM, Ariel Wolfmann < <>>

Hi Pat,

My name is Ariel Wolfmann, I'm from Cordoba, Argentina, and i study computer science.
I'm very interested to use  Mahout spark-itemsimilarity in my thesis, so i'm playing with
After reading carefully your posts on <>
and <>,
i'm trying with the examples, but i'm getting empty outputs when running:

* ./bin/mahout spark-itemsimilarity --input multiple_intro.csv --output outsis2  --master
spark://ariel-Satellite-L655:7077 <> --filter1 purchase --filter2 view --itemIDColumn
2 --rowIDColumn 0 --filterColumn 1

where multiple_intro.csv contains things like:
u1, purchase, iphone
u1, purchase, ipad
u2, purchase, nexus

as the tutorial of mahout suggest (the first simple example i could run succesfully, with
no empty output)

* ./bin/mahout spark-itemsimilarity --input multiple_intro2.csv --output outmulti2  --master
spark://ariel-Satellite-L655:7077 <> --filter1 like --itemIDColumn 2 --rowIDColumn 0
--filterColumn 1

where multiple_intro2.csv contains things like: 
906507914, dislike, mars_needs_moms
906507914, like, moneyball
as you suggest on your post.

In both cases, in the output folder i'm getting the files part-00000 and _SUCCESS, but part-00000
is empty..

I'm using Mahout 1.0, Spark 1.0.1 and Hadoop 2.4.1, running Hadoop as Pseudo-distributed on
my laptop.

Do you have any idea what i'm doing wrong?


Ariel Wolfmann

Ariel Wolfmann

Ariel Wolfmann

Ariel Wolfmann

Ariel Wolfmann

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message