flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-4613) Extend ALS to handle implicit feedback datasets
Date Fri, 23 Sep 2016 14:47:20 GMT

    [ https://issues.apache.org/jira/browse/FLINK-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15516644#comment-15516644

ASF GitHub Bot commented on FLINK-4613:

Github user gaborhermann commented on the issue:

    We did not measure performance against Spark or other implementations yet. Those would
reflect the performance of Flink ALS implementation, as there is not much difference between
the implicit and explicit implementations.
    Instead, we compared the implicit case with the explicit case in the Flink implementation
on the same datasets, to make sure the implicit case does not decrease the performance significantly.
(Of course, we expected the implicit case to be slower due to the extra precomputation and
broadcasting of `Xt * X`.)
            expl  impl
    100     8885   9196
    1000    7879  11282
    10000   8839   9220
    100000  7102  10998
    1000000 7543  10680
    The numbers in the left column indicate the size of the training set (I'm not sure about
the measure, but @jfeher can tell about it). The numbers are the training time in milliseconds
in the explicit and implicit case respectively. We did the measurements on a small cluster
of 3 nodes.
    It seems, there is a large constant overhead, but it's not significantly slower in the
implicit case.
    We could do further, more thorough measurements if needed, but maybe that would be another
issue. Benchmarking more and optimizing both the original ALS algorithm and the specific `Xt
* X` computation in the implicit case could be a separate PR.
    What are your thoughts on this?

> Extend ALS to handle implicit feedback datasets
> -----------------------------------------------
>                 Key: FLINK-4613
>                 URL: https://issues.apache.org/jira/browse/FLINK-4613
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Gábor Hermann
>            Assignee: Gábor Hermann
> The Alternating Least Squares implementation should be extended to handle _implicit feedback_
datasets. These datasets do not contain explicit ratings by users, they are rather built by
collecting user behavior (e.g. user listened to artist X for Y minutes), and they require
a slightly different optimization objective. See details by [Hu et al|http://dx.doi.org/10.1109/ICDM.2008.22].
> We do not need to modify much in the original ALS algorithm. See [Spark ALS implementation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala],
which could be a basis for this extension. Only the updating factor part is modified, and
most of the changes are in the local parts of the algorithm (i.e. UDFs). In fact, the only
modification that is not local, is precomputing a matrix product Y^T * Y and broadcasting
it to all the nodes, which we can do with broadcast DataSets. 

This message was sent by Atlassian JIRA

View raw message