[ https://issues.apache.org/jira/browse/FLINK4961?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=15670658#comment15670658
]
ASF GitHub Bot commented on FLINK4961:

Github user gaborhermann commented on the issue:
https://github.com/apache/flink/pull/2819
There are some open questions:
1. Should we optimize 3 way join? For now the join order is burnt into the code, also
we might be able to give hints for join strategies.
2. How should we handle empty blocks? When matching a rating block with the current factor
blocks there might be no rating block or no factor blocks with that id, as the rating block
corresponds to differnt user and item block at every iteration. For now we do the join between
the blocks with a `coGroup`, and do basically a fullouterjoin, because we need to change
the rating block ID for every factor block at each iteration. This might not be the most optimal
solution (see comments at `coGroup`), but I don't see a better one right now.
3. The number of blocks determine also the number of iterations. Therefore the higher
number of blocks degrade the performance. We conducted experiments on a cluster that shows
this:
see [plot for movielens data](https://s18.postimg.org/txap3x9o9/movielens_blocks.png)
and [for lfm_1b data](https://s11.postimg.org/ysnonuer7/lfm1b_blocks.png). Based on this we
would recommend setting the number of blocks to the smallest possible that can fit into memory
(and at least the parallelism of the execution). There might be some way to avoid this and
break the computation to more blocks while doing the same amount of iteration, but it's not
trivial because of the possibly conflicting useritem blocks (and why the paper uses this
blocking in the firstplace). Should we investigate this further? With the recommended settings
(and given enough memory) the algorithm performs well (see the plots).
4. The testing data is made by hand to ensure changes to the code does not change the
algorithm. The algorithm produces good results on real data. The question is whether we should
make a more thorough testing mechanism for matrix factorization (as proposed in the [PR for
iALS](https://github.com/apache/flink/pull/2542)) or is this kind of testing sufficient?
> SGD for Matrix Factorization
> 
>
> Key: FLINK4961
> URL: https://issues.apache.org/jira/browse/FLINK4961
> Project: Flink
> Issue Type: New Feature
> Components: Machine Learning Library
> Reporter: Gábor Hermann
> Assignee: Gábor Hermann
>
> We have started an implementation of distributed stochastic gradient descent for matrix
factorization based on Gemulla et al. [1].
> The main problem with distributed SGD in general is the conflicting updates of the model
variable. In case of matrix factorization we can avoid conflicting updates by carefully deciding
in each iteration step which blocks of the rating matrix we should use to update the corresponding
blocks of the user and item matrices (see Figure 1. in paper).
> Although a general SGD solver might seem relevant for this issue, we can do much better
in the special case of matrix factorization. E.g. in case of a linear regression model, the
model is broadcasted in every iteration. As the model is typically small in that case, we
can only avoid conflicts by having a "global" model. Based on this, the general SGD solver
is a different issue.
> To give more details, the algorithm works as follows.
> We randomly create user and item vectors, then randomly partition them into {{k}} user
and {{k}} item blocks. Based on these factor blocks we partition the rating matrix to {{k
* k}} blocks correspondingly.
> In one iteration step we choose {{k}} nonconflicting rating blocks, i.e. we should not
choose two rating blocks simultaneously with the same user or item block. This is done by
assigning a rating block ID to every user and item block. We match the user, item, and rating
blocks by the current rating block ID, and update the user and item factors by the ratings
locally. We also update the rating block ID for the factor blocks, thus in the next iteration
we use other rating blocks to update the factors.
> In {{k}} iteration we sweep through the whole rating matrix of {{k * k}} blocks (so instead
of {{numberOfIterationSteps}} iterations we should do {{k * numberOfIterationSteps}} iterations).
> [1] [http://people.mpiinf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf]

This message was sent by Atlassian JIRA
(v6.3.4#6332)
