spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Evan R. Sparks" <evan.spa...@gmail.com>
Subject Re: MLBase Test
Date Sun, 01 Dec 2013 01:00:18 GMT
The MLI repo doesn't yet have support for collaborative filtering, though
we've got a private branch we're working on cleaning up that will add it
shortly. To use MLI, you need to build it with sbt/sbt assembly, and then
make sure all workers have access to it by passing the filename of the jar
to SparkContext when you create it.

For now, your best bet is to just use the MLlib implementation of ALS
that's in Spark today.

If you have an input file where each line is of the format
"user,song,rating", you could load up your data for appropriate input like
this:

val ratings = sc.textFile(ratingsFile).map { line =>
      val fields = line.split(',')
      Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble)
    }.cache()


Then, you can train with something like:
val model = ALS.train(ratings, 10, 100, 0.1)

You can also take a look at the ALS code in MLlib - there's a command line
tool that will do the same thing and save your model to a couple of files.

This will train a MatrixFactorizationModel of rank 10 in 100 iterations
with a regularization parameter of 0.1.

As for how I came up with those values -
Rank is a measure of model complexity - the model is estimated as
essentially #rank parameters per user and #rank parameters per song. The
larger, the more complex the model (but also, the more complex it is to
train and the greater the chance that you're overfitting to the input
data.) Reasonable values are anywhere from 5 to 50.

Iterations is the number of passes of the ALS algorithm to run - eventually
the model will converge to some roughly fixed point and additional
iterations won't change it much. Reasonable values are anywhere from 10 to
1000 - depending on the complexity of the data and the rank of the model.
There are checks for early termination you can do when training these
things, but that's not currently implemented in spark.

Regularization is a tool that encourages model sparsity. Higher
regularization encourages model parameters that are near zero to stay
small. This is one way to combat overfitting, and often yields models that
work better on an out-of-sample basis.

- Evan



On Sat, Nov 30, 2013 at 12:04 PM, Aslan Bekirov <aslanbekirov@gmail.com>wrote:

> Hi Evan,
>
> Thank you very much for your quick response.
>
> I am using ALS to create model, here is my method
>
> def doCollab() {
>
>     val sc = new SparkContext("local[2]", "Log Query")
>     val mc = new MLContext(sc)
>     var pairs = mc.load("user_song_pairs", 1 to 2)
>     val  ratings = mc.load("user_ratings", 1)
>
>     val als = new ALS()
>     als.setBlocks(-1)
>     als.setIterations(15)
>     als.setRank(10)
>
>     val model = als.run(ratings)
>
>   }
>
> But here first of all MLConext could not resolved, Am I creating context
> wrongly?
>
> Secondly ALS has parameters like
>
>    - *rank* is the number of latent factors in our model.
>    - *iterations* is the number of iterations to run.
>    - *lambda* specifies the regularization parameter in ALS.
>
> But I could not find some example values for this parameters. Can you give
> a bit more explanation for these and give some example values?
>
> BR,
> Aslan
>
>
>
>
>
> On Fri, Nov 29, 2013 at 9:03 PM, Evan Sparks <evan.sparks@gmail.com>wrote:
>
>> Hi Aslan,
>>
>> You'll need to link against the spark-mllib artifact. The method we have
>> currently for collaborative filtering is ALS.
>>
>> Documentation is available here -
>> http://spark.incubator.apache.org/docs/latest/mllib-guide.html
>>
>> We're working on a more complete ALS tutorial, and will link to it from
>> that page when it's ready.
>>
>> - Evan
>>
>> > On Nov 29, 2013, at 10:33 AM, Aslan Bekirov <aslanbekirov@gmail.com>
>> wrote:
>> >
>> > Hi All,
>> >
>> > I am trying to do collaborative filtering with  MLbase. I am using
>> spark 0.8.0
>> >
>> > I have some basic questions.
>> >
>> > 1) I am using maven and added dependency to my pom
>> > <dependency>
>> >             <groupId>org.apache.spark</groupId>
>> >             <artifactId>spark-core_2.9.3</artifactId>
>> >             <version>0.8.0-incubating</version>
>> >         </dependency>
>> >
>> > I could not see any MLbase related classes in downloaded jar that is
>> why I could not import mli libraries. Am I missing something? Do I have to
>> add some more dependency for mli?
>> >
>> > 2) Is there exist java api for MLBase?
>> >
>> > Thanks in advance,
>> >
>> > BR,
>> > Aslan
>>
>
>

Mime
View raw message