mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: SSVD fails on seq2sparse output.
Date Sun, 18 Nov 2012 19:31:24 GMT
ALS-WR is a great fit for this input. Your pre-processing is a good way to
add some extra info to the process.

I believe the implicitFeedback=true setting does make it follow the paper
you cite. It's no longer estimating the input (i.e. not estimating
'ratings') but using the input values as loss function weights. This works
nicely.

Yes as Sebastian says, it speeds things up greatly to put the feature
matrices in memory, but with 20M users that is way bigger than the memory
allocated to your reducers.

PS I think I mentioned off-list, but this is more or less exactly the basis
of Myrrix (http://myrrix.com). It should be able to handle this scale,
maybe slightly more easily since it can load only the subset of these
matrices needed by each worker -- more reducers means less RAM per reducer.
You might also try this out if scale is the issue.


On Sun, Nov 18, 2012 at 4:22 PM, Abramov Pavel <p.abramov@rambler-co.ru>wrote:

> Many thanks for your explanation about SVD and ALS-WR factor models. You
> are absolutely right: we don't have "negative feedback" data and
> "preference" data.
>
> Unfortunately we can't use content based algorithms ("to grab URL
> content") right now. What we have is an item title (2-10 terms) but not a
> whole item content.
> We use this data to merge different urls (word stemming, pruning
> stop-words etc). As a result we interpret different URLs with "similar"
> title as a single url. This step reduces items count.
> The day will come and we'll combine content filtering and CF ))
>
>
> Can you please help me with 2 issues regarding ALS-WR:
> 1) Will "implicitFeedback true" parameter for parallelALS enable the
> technique described in "CF for Implicit Feedback Datasets"? (thanks for
> link to this paper btw)
> 2) Is there any detailed description of parallelALS job? I can't run
> ALS-WR with my data. It fails during M matrix job on 1st iteration (right
> after U matrix job completes). I am not sure it is good idea, but I
> decreased max split size to force mappers use less data. W/o this
> parameter mappers of M job fail during "initializing" phase.
>
> =================================
> mahout parallelALS \
> -i /tmp/pabramov/sparse/als_input/  \
> -o /tmp/pabramov/sparse/als_output \
> --numFeatures 21 \
> --numIterations 15 \
> --lambda 0.065 \
> --tempDir /tmp/pabramov/tmpALS \
> --implicitFeedback true \
> -Dmapred.max.split.size=4000000 \
> -Dmapred.map.child.java.opts="-Xmx3024m -XX:-UseGCOverheadLimit"
> =================================
>
>
>
> Thanks!
>
> Pavel
>
>
>
>
>
>
> 16.11.12 0:31 пользователь "Dmitriy Lyubimov" <dlieu.7@gmail.com> написал:
>
> >On Thu, Nov 15, 2012 at 12:09 PM, Abramov Pavel
> ><p.abramov@rambler-co.ru>wrote:
> >
> >> Dmitriy,
> >>
> >> 1) Thank you, I'll try 0.8 instead of 0.7.
> >>
> >> 2) Regarding my problem and seq2sparse. We do not perform text analysis.
> >> We perform user click analysis. My documents are internet users and my
> >> terms are url clicked. Input data contains "user_id<>url1, url2, url1,
> >>urlN
> >> etc" vectors. It is realy easy to convert these vectors to sparse TFIDF
> >> vectors using seq2sparse. The frequency of URLs fallows Power Law. Thats
> >> why we use seq2sparse with TFIDF weighting. My goals are:
> >> - to recommend new URL to user
> >> - to reduce the User<>url dimension for both users (U) and urls(V)
> >> analysis (clustering, classification etc).
> >> - to find the similarity between user and url. ( dot_product{Ua[i],
> >>Va[j]}
> >> )
> >>
> >> Is SVD a suitable solution for this problem?
> >>
> >Like i said, i don't think so.
> >
> >Somebody just came around with exact same problem another day.
> >
> >* First off,  if your data is sparse (i.e. there's no data for user
> >affection to a particular url just because user never figured that url
> >existed) SVD is terrible for that because it cannot tell if user just did
> >not visit url to-date because he did not know or because he did not like
> >it. Like i said, ALS-WR is an improvement over this but it still lacks in
> > a sense that you'd be better off by encoding implicit feedback part and
> >confidence for the factorizer. See http://research.yahoo.com/pub/2433
> >which
> >is now very popular approach. Ask Sebastian Schelter how we do it Mahout.
> >
> >* second off, your data will still probably too sparse for a good
> >inference. *I think* it would eventually help if you could grab the
> >content
> >of the pages and map them into a topical space using LSA or LDA(CVB in
> >Mahout). Once you have content info behind urls, you'd be able to combine
> >(boost) factorization and regression on content (e.g. you could train
> >regression on url content side first to predict average user response, and
> >then use implicit feedback factorization to guess factors of residual
> >based
> >on a user). But i guess there's no precooked method here for it. But that
> >would probably be the most accurate thing to do. (Eventually you may also
> >want to do some time series ema weighting and autoregression for the
> >result
> >i guess too which might yield even better approximations for affinities
> >based on time of the training data as well as current time).
> >
> >
> >> 3) I can apply SSVD on a sample (0,1% of my data). But it fails with
> >>100%
> >> of data. (Bt-job stops on a Map phase with "Java heap space" errors or
> >> "timeout" errors).
> >> Input matrix is a sparse matrix 20 000 000 X 150 000 with ~0,03%
> >>non-zero
> >> values. (8GB total)
> >>
> >> How I use it:
> >>
> >> ====================
> >> mahout-distribution-0.7/bin/mahout ssvd \
> >> -i /tmp/pabramov/sparse/tfidf-vectors/ \
> >> -o /tmp/pabramov/ssvd \
> >> -k 200 \
> >> -q 1 \
> >> --reduceTasks 150 \
> >> --tempDir /tmp/pabramov/tmp \
> >> -Dmapred.max.split.size=1000000 \
> >> -ow
> >> ====================
> >>
> >> Can't pass Bt-job... Should I decrease split.size and/or add extra
> >>params?
> >> Hadoop has 400 Map and 300 reduce slots with 1 CPU core and 2GB RAM per
> >> task.
> >> Q-job completes in 20 minutes.
> >>
> >> Many thanks in advance!
> >>
> >> Pavel
> >>
> >>
> >> ________________________________________
> >> От: Dmitriy Lyubimov [dlieu.7@gmail.com]
> >> Отправлено: 15 ноября 2012 г. 21:53
> >> To: user@mahout.apache.org
> >> Тема: Re: SSVD fails on seq2sparse output.
> >>
> >> On Thu, Nov 15, 2012 at 3:43 AM, Abramov Pavel <p.abramov@rambler-co.ru
> >> >wrote:
> >>
> >> >
> >> > Many thanks in advance, any suggestion is highly appreciated. I Don't
> >> know
> >> > what to do, CF produces inaccurate results for my tasks, SVD is the
> >>only
> >> > hope ))
> >> >
> >>
> >> I also doubtful about that. (if you trying to factorize our
> >>recommendation
> >> space). SVD has proven to be notoriously inadequate for that problem.
> >> ALS-WR would be a much better first stab.
> >>
> >> however since you seem to be performing text analysis (seq2sparse), i
> >>don't
> >> see immediately how it is related to collaborative filtering -- perhaps
> >>if
> >> you told more about your problem, i am sure here are people on this list
> >> who could advise you about perhaps one of the best courses of action.
> >>
> >>
> >> > Regards,
> >> > Pavel
> >> >
> >> >
> >> >
> >> >
> >> >
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message