mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: spark-itemsimilarity slower than itemsimilarity
Date Sat, 01 Oct 2016 00:12:25 GMT
Yeah, I bet Sebastian is right. I see no reason not to try running with --master local[4] or
some number of cores on localhost. This will avoid all serialization. With times that low
and small data there is no benefit to separate machines.

We are using this with ~1TB of data. Using Mahout as a lib we also set the partitioning to
4x the number of cores in the cluster with --conf spark.parallelism=384 on 3 very large machines
it complete in 42 minutes. We also have 11 events so 11 different input matrices. 

I’m testing a way to train on only what is used in the model creation, which is a way to
take pairRDDs as input and filter out every interaction that is not from a user in the primary
matrix. With 1TB this reduces the data to a very manageable number and reduces the size of
the BiMaps dramatically (they require memory). It may not have as pronounced an effect on
different data but will make the computation more efficient. With the Universal Recommender
in the PredictionIO template we still use all user input in the queries so you loose no precision
and get real-time user interactions in your queries. If you want a more modern recommender
than the old Mahout MapReduce I’d strongly suggest you consider it or at least use it as
a model for building a recommender with Mahout.

On Sep 30, 2016, at 10:39 AM, Sebastian <> wrote:

Hi Arnau,

I don't think that you can expect any speedups in your setup, your input data is way to small
and I think you run only two concurrent tasks. Maybe you should try a larger sample of your
data and more machines.

At the moment, it seems to me that the overheads of running in a distributed setting (task
scheduling, serialization...) totally dominate the computation.


On 30.09.2016 11:11, Arnau Sanchez wrote:
> Hi!
> Here you go: "ratings-clean" contains only pairs of (user, product) for those products
with 4 or more user interactions (770k -> 465k):
> The results:
> 1 part of 465k:   3m41.361s
> 5 parts of 100k:  4m20.785s
> 24 pars of 20k:  10m44.375s
> 47 parts of 10k: 17m39.385s
> On Fri, 30 Sep 2016 00:09:13 +0200 Sebastian <> wrote:
>> Hi Arnau,
>> I had a look at your ratings file and its kind of strange. Its pretty
>> tiny (770k ratings, 8MB), but it has more than 250k distinct items. Out
>> of these, only 50k have more than 3 interactions.
>> So I think the first thing that you should do is throw out all the items
>> with so few interactions. Item similarity computations are pretty
>> sensitive to the number of unique items, maybe thats why you don't see
>> much difference in the run times.
>> -s
>> On 29.09.2016 22:17, Arnau Sanchez wrote:
>>> --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10

View raw message