mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abramov Pavel <>
Subject HA: SSVD fails on seq2sparse output.
Date Thu, 15 Nov 2012 20:09:26 GMT

1) Thank you, I'll try 0.8 instead of 0.7.

2) Regarding my problem and seq2sparse. We do not perform text analysis. We perform user click
analysis. My documents are internet users and my terms are url clicked. Input data contains
"user_id<>url1, url2, url1, urlN etc" vectors. It is realy easy to convert these vectors
to sparse TFIDF vectors using seq2sparse. The frequency of URLs fallows Power Law. Thats why
we use seq2sparse with TFIDF weighting. My goals are:
- to recommend new URL to user
- to reduce the User<>url dimension for both users (U) and urls(V) analysis (clustering,
classification etc). 
- to find the similarity between user and url. ( dot_product{Ua[i], Va[j]} )

Is SVD a suitable solution for this problem?

3) I can apply SSVD on a sample (0,1% of my data). But it fails with 100% of data. (Bt-job
stops on a Map phase with "Java heap space" errors or "timeout" errors). 
Input matrix is a sparse matrix 20 000 000 X 150 000 with ~0,03% non-zero values. (8GB total)

How I use it:

mahout-distribution-0.7/bin/mahout ssvd \
-i /tmp/pabramov/sparse/tfidf-vectors/ \
-o /tmp/pabramov/ssvd \
-k 200 \
-q 1 \
--reduceTasks 150 \
--tempDir /tmp/pabramov/tmp \
-Dmapred.max.split.size=1000000 \

Can't pass Bt-job... Should I decrease split.size and/or add extra params? Hadoop has 400
Map and 300 reduce slots with 1 CPU core and 2GB RAM per task. 
Q-job completes in 20 minutes.

Many thanks in advance!


От: Dmitriy Lyubimov []
Отправлено: 15 ноября 2012 г. 21:53
Тема: Re: SSVD fails on seq2sparse output.

On Thu, Nov 15, 2012 at 3:43 AM, Abramov Pavel <>wrote:

> Many thanks in advance, any suggestion is highly appreciated. I Don't know
> what to do, CF produces inaccurate results for my tasks, SVD is the only
> hope ))

I also doubtful about that. (if you trying to factorize our recommendation
space). SVD has proven to be notoriously inadequate for that problem.
ALS-WR would be a much better first stab.

however since you seem to be performing text analysis (seq2sparse), i don't
see immediately how it is related to collaborative filtering -- perhaps if
you told more about your problem, i am sure here are people on this list
who could advise you about perhaps one of the best courses of action.

> Regards,
> Pavel

View raw message