flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-4961) SGD for Matrix Factorization
Date Wed, 16 Nov 2016 15:01:10 GMT

    [ https://issues.apache.org/jira/browse/FLINK-4961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15670629#comment-15670629
] 

ASF GitHub Bot commented on FLINK-4961:
---------------------------------------

GitHub user gaborhermann opened a pull request:

    https://github.com/apache/flink/pull/2819

    [FLINK-4961] [ml] SGD for Matrix Factorization (WIP)

    Please note, that this is a work-in-progress PR, to discuss some design questions. There
are minor things to be done including the documentation (Scala docs are done). Apart from
these and the questions worth discussing the PR is ready.
    
    Some notes:
    - Generalized matrix factorization methods into `MatrixFactorization` abstract class (this
slightly modifies `ALS`).
    - The algorithm could be executed in parts with `MLTools.persist`, just like in `ALS`
(to use less memory).
    - The algorithm uses random block ID initialization, and shuffles also the data when doing
the updates. However, the algorithm can be made deterministic by setting a seed.
    - The objective function is simply squared loss with L2 regularization in contrast to
`ALS`s weighted-lambda-regularization. This could be extended later to use other regularization
methods too, as SGD is more flexible in terms of loss functions.
    - The same methods could be used for dynamically changing the learning rate as in the
`GradientDescent` implementation.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gaborhermann/flink dsgd

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/2819.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2819
    
----
commit 88fffbf86e7a2ac8b1adc459e01e084ab2492e07
Author: Daniel Abram <abram.daniel@hotmail.com>
Date:   2016-11-16T13:34:51Z

    [FLINK-4961] SGD for Matrix Factorization

commit 9bd6f2ea4a4fec2e7f4c64cf2b14453f3ba91e48
Author: Gábor Hermann <code@gaborhermann.com>
Date:   2016-11-16T13:35:10Z

    [FLINK-4961] SGD for Matrix Factorization test

----


> SGD for Matrix Factorization
> ----------------------------
>
>                 Key: FLINK-4961
>                 URL: https://issues.apache.org/jira/browse/FLINK-4961
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Gábor Hermann
>            Assignee: Gábor Hermann
>
> We have started an implementation of distributed stochastic gradient descent for matrix
factorization based on Gemulla et al. [1].
> The main problem with distributed SGD in general is the conflicting updates of the model
variable. In case of matrix factorization we can avoid conflicting updates by carefully deciding
in each iteration step which blocks of the rating matrix we should use to update the corresponding
blocks of the user and item matrices (see Figure 1. in paper).
> Although a general SGD solver might seem relevant for this issue, we can do much better
in the special case of matrix factorization. E.g. in case of a linear regression model, the
model is broadcasted in every iteration. As the model is typically small in that case, we
can only avoid conflicts by having a "global" model. Based on this, the general SGD solver
is a different issue.
> To give more details, the algorithm works as follows.
> We randomly create user and item vectors, then randomly partition them into {{k}} user
and {{k}} item blocks. Based on these factor blocks we partition the rating matrix to {{k
* k}} blocks correspondingly.
> In one iteration step we choose {{k}} non-conflicting rating blocks, i.e. we should not
choose two rating blocks simultaneously with the same user or item block. This is done by
assigning a rating block ID to every user and item block. We match the user, item, and rating
blocks by the current rating block ID, and update the user and item factors by the ratings
locally. We also update the rating block ID for the factor blocks, thus in the next iteration
we use other rating blocks to update the factors.
> In {{k}} iteration we sweep through the whole rating matrix of {{k * k}} blocks (so instead
of {{numberOfIterationSteps}} iterations we should do {{k * numberOfIterationSteps}} iterations).
> [1] [http://people.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message