flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1081) Add HDFS file-stream source for streaming
Date Sat, 06 Dec 2014 04:36:12 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14236585#comment-14236585

ASF GitHub Bot commented on FLINK-1081:

Github user chiwanpark commented on the pull request:

    I suggest a new implementation of this feature. I hope for many feedback about this idea.
There are two functions for this feature.
    1. `FileMonitoringFunction` emits a tuple with 3 parameters. (modified file path, start
offset, end offset) This function implements `NonParallelInput`.
    2. `FileMapFunction` (I think that renaming of this function is required) reads file that
have the file path and emits contents in given range. This function implements `FlatMapFunction`
because there is no method to link between two source functions.
    When a user calls `readFileStream` in `StreamExecutionEnvironment`, the system creates
a `FileMonitoringFunction` and `FileMapFunction` and links them and returns them.
    With this implementation, we can fix the problem about parallelism with monitoring instance.
The user can set degree of parallelism of source. In fact, the user set degree of parallelism
of map function. There is only one instance monitoring file system.
    Additionally, we can reuse `FileMapFunction` to substitute `FileSourceFunction`.
    How about this implementation?

> Add HDFS file-stream source for streaming
> -----------------------------------------
>                 Key: FLINK-1081
>                 URL: https://issues.apache.org/jira/browse/FLINK-1081
>             Project: Flink
>          Issue Type: Improvement
>          Components: Streaming
>    Affects Versions: 0.7.0-incubating
>            Reporter: Gyula Fora
>            Assignee: Chiwan Park
>              Labels: starter
> Add data stream source that will monitor a slected directory on HDFS (or other filesystems
as well) and will process all new files created.

This message was sent by Atlassian JIRA

View raw message