flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Schrott (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (FLINK-1731) Add kMeans clustering algorithm to machine learning library
Date Wed, 13 May 2015 07:33:59 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541516#comment-14541516
] 

Peter Schrott edited comment on FLINK-1731 at 5/13/15 7:33 AM:
---------------------------------------------------------------

Hi [~chiwanpark],

the thing is, to fit the model, the KMeans uses two datasets. One is the training data, the
other are the initial centroids. The initial centroids are used to create the appropriated
clusters on the training dataset. These clusters define the fitted model.

This means, the {{fit}}-method should take two attributes at that point. This is the reason
why I suggested to use the parameter map for passing the initial centroids. The training dataset
will be passed as argument to the {{fit}}-method, equally to the CoCoA implementation.

The test dataset will be applied to the trained model afterwards.


was (Author: peedeex21):
Hi [~chiwanpark],

the thing is, to fit the model, the KMeans uses two datasets. One is the training data, the
other are the initial centroids. 

This means, the {{fit}}-method should take two attributes at that point. This is the reason
why I suggested to use the parameter map for passing the initial centroids. The training dataset
will be passed as argument to the {{fit}}-method, equally to the CoCoA implementation.



> Add kMeans clustering algorithm to machine learning library
> -----------------------------------------------------------
>
>                 Key: FLINK-1731
>                 URL: https://issues.apache.org/jira/browse/FLINK-1731
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Alexander Alexandrov
>              Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not yet ported
to the machine learning library. I assume that only the used data types have to be adapted
and then it can be more or less directly moved to flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better implementation because
the improve the initial seeding phase to achieve near optimal clustering. It might be worthwhile
to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message