flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 陈梓立 (JIRA) <j...@apache.org>
Subject [jira] [Commented] (FLINK-10038) Parallel the creation of InputSplit if necessary
Date Thu, 30 Aug 2018 09:07:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-10038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597215#comment-16597215
] 

陈梓立 commented on FLINK-10038:
-----------------------------

My original purpose of mention "parallelize the creation of InputSplit" might be parallelize
the creation of ONE InputSplit. Take a look at {{FileInputFormat#createInputSplits}}, it creates
InputSplits file by file. Here is where I aim to parallelize. Thus it said "the interface
for the creation of input splits is definitely InputSplitSource#createInputSplits". And this
could be done without modify the interface, by change the implementation of {{createInputSplits}}.

However, your ideas here are also brightly. Let's say a typical case gain benefits from these
ideas is batch job with many files, where would prefer to using RegionFailover strategy if
possible.
Here I see 3 options. 1. create InputSplits before job running. 2. create InputSplits concurrent
to scheduling the job. 3. Use a specific single task to generate the work.

Option 1 is easier to implement as [~StephanEwen] said. Below with concrete challenges for
the rest options.

The main issue I concern is in batch job, we prefer not to cancelling all vertices and restart.
What's worse, since we don't have batch checkpoint, the batch job has to restart completely.
This is unacceptable for large scale batch job.
For option2, what if jm failover after some input splits have been computed and sent off?
We don't have specific jm failover strategy now, thus it cause the job completely restarted.
By continue this option, it leads to discuss A jm failover strategy, that is, when jm failover
and restart, it can recover(reconcile) state from the previous one.
For option3, there would be a wider consider about Source. Take two input case into consider(below).
Currently we read from source blocking, now we compute the input split as a single task, if
we still use blocking approach, the downstream maybe stuck for waiting one input while the
other input is ready to be read.

Src1 ----\
Src2---->Join

One way to solve this issue is we read from the source unblocking. Assume introduce a method
{{boolean SourceFunction#next(Collector<T>)}}, when the downstream calling it, the source
sent its data to the collector and return true. If there remains no more data, it return false.
This also async source read from file and produce data.

To sum up, focusing more on batch job, the main issue concerned would be jm failover for option
1 and 2(also extern but significant batch checkpoint), and more flexible source for option
3.

> Parallel the creation of InputSplit if necessary
> ------------------------------------------------
>
>                 Key: FLINK-10038
>                 URL: https://issues.apache.org/jira/browse/FLINK-10038
>             Project: Flink
>          Issue Type: Improvement
>          Components: Distributed Coordination
>    Affects Versions: 1.5.0
>            Reporter: 陈梓立
>            Priority: Major
>              Labels: improvement, inputformat, parallel, perfomance
>
> As a continue to the discussion in the PR about parallelize the creation of ExecutionJobVertex
[here|https://github.com/apache/flink/pull/6353].
> [~StephanEwen] suggested that we could parallelize the creation of InputSplit, from which
we gain performance improvements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message