mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Han JU <ju.han.fe...@gmail.com>
Subject Re: ALS-WR on Million Song dataset
Date Tue, 19 Mar 2013 14:31:31 GMT
Thanks Sebastian and Sean, I will dig more into the paper.
With a simple try on a small part of the data, it seems larger alpha (~40)
gets me a better result.
Do you have an idea how long it will be for ParellelALS for the 700mb
complete dataset? It contains ~48 million triples. The hadoop cluster I
dispose is of 5 nodes and can factorize the movieLens 10M in about 13min.


2013/3/18 Sebastian Schelter <ssc@apache.org>

> You should also be aware that the alpha parameter comes from a formula
> the authors introduce to measure the "confidence" in the observed values:
>
> confidence = 1 + alpha * observed_value
>
> You can also change that formula in the code to something that you see
> more fit, the paper even suggests alternative variants.
>
> Best,
> Sebastian
>
>
> On 18.03.2013 18:06, Han JU wrote:
> > Thanks for quick responses.
> >
> > Yes it's that dataset. What I'm using is triplets of "user_id song_id
> > play_times", of ~ 1m users. No audio things, just plein text triples.
> >
> > It seems to me that the paper about "implicit feedback" matchs well this
> > dataset: no explicit ratings, but times of listening to a song.
> >
> > Thank you Sean for the alpha value, I think they use big numbers is
> because
> > their values in the R matrix is big.
> >
> >
> > 2013/3/18 Sebastian Schelter <ssc.open@googlemail.com>
> >
> >> JU,
> >>
> >> are you refering to this dataset?
> >>
> >> http://labrosa.ee.columbia.edu/millionsong/tasteprofile
> >>
> >> On 18.03.2013 17:47, Sean Owen wrote:
> >>> One word of caution, is that there are at least two papers on ALS and
> >> they
> >>> define lambda differently. I think you are talking about "Collaborative
> >>> Filtering for Implicit Feedback Datasets".
> >>>
> >>> I've been working with some folks who point out that alpha=40 seems to
> be
> >>> too high for most data sets. After running some tests on common data
> >> sets,
> >>> alpha=1 looks much better. YMMV.
> >>>
> >>> In the end you have to evaluate these two parameters, and the # of
> >>> features, across a range to determine what's best.
> >>>
> >>> Is this data set not a bunch of audio features? I am not sure it works
> >> for
> >>> ALS, not naturally at least.
> >>>
> >>>
> >>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <ju.han.felix@gmail.com>
> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I'm wondering has someone tried ParallelALS with implicite feedback
> job
> >> on
> >>>> million song dataset? Some pointers on alpha and lambda?
> >>>>
> >>>> In the paper alpha is 40 and lambda is 150, but I don't know what are
> >> their
> >>>> r values in the matrix. They said is based on time units that users
> have
> >>>> watched the show, so may be it's big.
> >>>>
> >>>> Many thanks!
> >>>> --
> >>>> *JU Han*
> >>>>
> >>>> UTC   -  Université de Technologie de Compiègne
> >>>> *     **GI06 - Fouille de Données et Décisionnel*
> >>>>
> >>>> +33 0619608888
> >>>>
> >>>
> >>
> >>
> >
> >
>
>


-- 
*JU Han*

Software Engineer Intern @ KXEN Inc.
UTC   -  Université de Technologie de Compiègne
*     **GI06 - Fouille de Données et Décisionnel*

+33 0619608888

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message