From gaborhermann <...@git.apache.org>
Subject [GitHub] flink issue #2819: [FLINK-4961] [ml] SGD for Matrix Factorization (WIP)
Date Wed, 16 Nov 2016 15:11:47 GMT
Github user gaborhermann commented on the issue:

    There are some open questions:
    1. Should we optimize 3 way join? For now the join order is burnt into the code, also
we might be able to give hints for join strategies.
    2. How should we handle empty blocks? When matching a rating block with the current factor
blocks there might be no rating block or no factor blocks with that id, as the rating block
corresponds to differnt user and item block at every iteration. For now we do the join between
the blocks with a `coGroup`, and do basically a full-outer-join, because we need to change
the rating block ID for every factor block at each iteration. This might not be the most optimal
solution (see comments at `coGroup`), but I don't see a better one right now.
    3. The number of blocks determine also the number of iterations. Therefore the higher
number of blocks degrade the performance. We conducted experiments on a cluster that shows
    see [plot for movielens data](https://s18.postimg.org/txap3x9o9/movielens_blocks.png)
and [for lfm_1b data](https://s11.postimg.org/ysnonuer7/lfm1b_blocks.png). Based on this we
would recommend setting the number of blocks to the smallest possible that can fit into memory
(and at least the parallelism of the execution). There might be some way to avoid this and
break the computation to more blocks while doing the same amount of iteration, but it's not
trivial because of the possibly conflicting user-item blocks (and why the paper uses this
blocking in the first-place). Should we investigate this further? With the recommended settings
(and given enough memory) the algorithm performs well (see the plots).
    4. The testing data is made by hand to ensure changes to the code does not change the
algorithm. The algorithm produces good results on real data. The question is whether we should
make a more thorough testing mechanism for matrix factorization (as proposed in the [PR for
iALS](https://github.com/apache/flink/pull/2542)) or is this kind of testing sufficient?

