spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aslan Bekirov <aslanbeki...@gmail.com>
Subject Re: MLBase Test
Date Sun, 01 Dec 2013 13:51:04 GMT
Thanks a lot Evan.

Your help is really appreciated.

BR,
Aslan


On Sun, Dec 1, 2013 at 3:00 AM, Evan R. Sparks <evan.sparks@gmail.com>wrote:

> The MLI repo doesn't yet have support for collaborative filtering, though
> we've got a private branch we're working on cleaning up that will add it
> shortly. To use MLI, you need to build it with sbt/sbt assembly, and then
> make sure all workers have access to it by passing the filename of the jar
> to SparkContext when you create it.
>
> For now, your best bet is to just use the MLlib implementation of ALS
> that's in Spark today.
>
> If you have an input file where each line is of the format
> "user,song,rating", you could load up your data for appropriate input like
> this:
>
> val ratings = sc.textFile(ratingsFile).map { line =>
>
>
>       val fields = line.split(',')
>
>
>       Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble)
>
>
>     }.cache()
>
>
> Then, you can train with something like:
> val model = ALS.train(ratings, 10, 100, 0.1)
>
> You can also take a look at the ALS code in MLlib - there's a command line
> tool that will do the same thing and save your model to a couple of files.
>
> This will train a MatrixFactorizationModel of rank 10 in 100 iterations
> with a regularization parameter of 0.1.
>
> As for how I came up with those values -
> Rank is a measure of model complexity - the model is estimated as
> essentially #rank parameters per user and #rank parameters per song. The
> larger, the more complex the model (but also, the more complex it is to
> train and the greater the chance that you're overfitting to the input
> data.) Reasonable values are anywhere from 5 to 50.
>
> Iterations is the number of passes of the ALS algorithm to run -
> eventually the model will converge to some roughly fixed point and
> additional iterations won't change it much. Reasonable values are anywhere
> from 10 to 1000 - depending on the complexity of the data and the rank of
> the model. There are checks for early termination you can do when training
> these things, but that's not currently implemented in spark.
>
> Regularization is a tool that encourages model sparsity. Higher
> regularization encourages model parameters that are near zero to stay
> small. This is one way to combat overfitting, and often yields models that
> work better on an out-of-sample basis.
>
> - Evan
>
>
>
> On Sat, Nov 30, 2013 at 12:04 PM, Aslan Bekirov <aslanbekirov@gmail.com>wrote:
>
>> Hi Evan,
>>
>> Thank you very much for your quick response.
>>
>> I am using ALS to create model, here is my method
>>
>> def doCollab() {
>>
>>     val sc = new SparkContext("local[2]", "Log Query")
>>     val mc = new MLContext(sc)
>>     var pairs = mc.load("user_song_pairs", 1 to 2)
>>     val  ratings = mc.load("user_ratings", 1)
>>
>>     val als = new ALS()
>>     als.setBlocks(-1)
>>     als.setIterations(15)
>>     als.setRank(10)
>>
>>     val model = als.run(ratings)
>>
>>   }
>>
>> But here first of all MLConext could not resolved, Am I creating context
>> wrongly?
>>
>> Secondly ALS has parameters like
>>
>>    - *rank* is the number of latent factors in our model.
>>    - *iterations* is the number of iterations to run.
>>    - *lambda* specifies the regularization parameter in ALS.
>>
>> But I could not find some example values for this parameters. Can you
>> give a bit more explanation for these and give some example values?
>>
>> BR,
>> Aslan
>>
>>
>>
>>
>>
>> On Fri, Nov 29, 2013 at 9:03 PM, Evan Sparks <evan.sparks@gmail.com>wrote:
>>
>>> Hi Aslan,
>>>
>>> You'll need to link against the spark-mllib artifact. The method we have
>>> currently for collaborative filtering is ALS.
>>>
>>> Documentation is available here -
>>> http://spark.incubator.apache.org/docs/latest/mllib-guide.html
>>>
>>> We're working on a more complete ALS tutorial, and will link to it from
>>> that page when it's ready.
>>>
>>> - Evan
>>>
>>> > On Nov 29, 2013, at 10:33 AM, Aslan Bekirov <aslanbekirov@gmail.com>
>>> wrote:
>>> >
>>> > Hi All,
>>> >
>>> > I am trying to do collaborative filtering with  MLbase. I am using
>>> spark 0.8.0
>>> >
>>> > I have some basic questions.
>>> >
>>> > 1) I am using maven and added dependency to my pom
>>> > <dependency>
>>> >             <groupId>org.apache.spark</groupId>
>>> >             <artifactId>spark-core_2.9.3</artifactId>
>>> >             <version>0.8.0-incubating</version>
>>> >         </dependency>
>>> >
>>> > I could not see any MLbase related classes in downloaded jar that is
>>> why I could not import mli libraries. Am I missing something? Do I have to
>>> add some more dependency for mli?
>>> >
>>> > 2) Is there exist java api for MLBase?
>>> >
>>> > Thanks in advance,
>>> >
>>> > BR,
>>> > Aslan
>>>
>>
>>
>

Mime
View raw message