[ https://issues.apache.org/jira/browse/FLINK4961?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=15689805#comment15689805
]
ASF GitHub Bot commented on FLINK4961:

Github user thvasilo commented on the issue:
https://github.com/apache/flink/pull/2819
Hello @gaborhermann,
I really like the idea of introducing a `MatrixFactorization` interface that we can then
use for different specialized optimization algorithms. For the question I'm afraid I can't
be of much help, I'll read the relevant paper this week and get back to you if I have any
more comments.
1. Don't know enough about joins to answer this :/
2. For this we would need to test the two solutions you have proposed and evaluate the
performance. You have listed a couple of pros/cons there, maybe you can elaborate?
3. If there is absolutely no benefit from using more blocks I don't see the need to investigate
further. We'll need to include these instructions in the docs.
4. I think test hardening should be done in a different PRs (potentially for more algorithms).
For now manual tests should suffice.
> SGD for Matrix Factorization
> 
>
> Key: FLINK4961
> URL: https://issues.apache.org/jira/browse/FLINK4961
> Project: Flink
> Issue Type: New Feature
> Components: Machine Learning Library
> Reporter: Gábor Hermann
> Assignee: Gábor Hermann
>
> We have started an implementation of distributed stochastic gradient descent for matrix
factorization based on Gemulla et al. [1].
> The main problem with distributed SGD in general is the conflicting updates of the model
variable. In case of matrix factorization we can avoid conflicting updates by carefully deciding
in each iteration step which blocks of the rating matrix we should use to update the corresponding
blocks of the user and item matrices (see Figure 1. in paper).
> Although a general SGD solver might seem relevant for this issue, we can do much better
in the special case of matrix factorization. E.g. in case of a linear regression model, the
model is broadcasted in every iteration. As the model is typically small in that case, we
can only avoid conflicts by having a "global" model. Based on this, the general SGD solver
is a different issue.
> To give more details, the algorithm works as follows.
> We randomly create user and item vectors, then randomly partition them into {{k}} user
and {{k}} item blocks. Based on these factor blocks we partition the rating matrix to {{k
* k}} blocks correspondingly.
> In one iteration step we choose {{k}} nonconflicting rating blocks, i.e. we should not
choose two rating blocks simultaneously with the same user or item block. This is done by
assigning a rating block ID to every user and item block. We match the user, item, and rating
blocks by the current rating block ID, and update the user and item factors by the ratings
locally. We also update the rating block ID for the factor blocks, thus in the next iteration
we use other rating blocks to update the factors.
> In {{k}} iteration we sweep through the whole rating matrix of {{k * k}} blocks (so instead
of {{numberOfIterationSteps}} iterations we should do {{k * numberOfIterationSteps}} iterations).
> [1] [http://people.mpiinf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf]

This message was sent by Atlassian JIRA
(v6.3.4#6332)
