flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1901) Create sample operator for Dataset
Date Fri, 31 Jul 2015 16:25:04 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14649429#comment-14649429

ASF GitHub Bot commented on FLINK-1901:

Github user tillrohrmann commented on the pull request:

    Thanks for your contribution @ChengXiangLi. The code is really well tested and well structured.
Great work :-)
    I had only some minor comments. There is however one thing I'm not so sure about. With
the current implementation, all parallel tasks of the sampling operator will get the same
random generator/seed value. Thus, every node will generate the same sequence of random numbers.
I think this can have a negative influence on the sampling. What we could do is to use `RichMapPartitionFunction`
instead of the `MapPartitionFunction`. With the rich function, we either have access to the
subtask index, given by `getRuntimeContext().getIndexOfThisSubtask()`,  which we could use
to modify the initial seed or we generate the random number generator in the `open` method
(this method is executed on the TaskManager). Assuming that the clocks are not completely
synchronized and that the individual tasks will be instantiated not at the same time, this
could give us less correlated random number sequences. What do you think? 

> Create sample operator for Dataset
> ----------------------------------
>                 Key: FLINK-1901
>                 URL: https://issues.apache.org/jira/browse/FLINK-1901
>             Project: Flink
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Theodore Vasiloudis
>            Assignee: Chengxiang Li
> In order to be able to implement Stochastic Gradient Descent and a number of other machine
learning algorithms we need to have a way to take a random sample from a Dataset.
> We need to be able to sample with or without replacement from the Dataset, choose the
relative size of the sample, and set a seed for reproducibility.

This message was sent by Atlassian JIRA

View raw message