flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephan Ewen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-10038) Parallel the creation of InputSplit if necessary
Date Mon, 06 Aug 2018 09:30:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-10038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569938#comment-16569938
] 

Stephan Ewen commented on FLINK-10038:
--------------------------------------

I think this is a nice idea and can help, especially batch use cases with many files.

Here are some design questions that we should probably discuss before:

  - When do we require split generation to be complete? Before starting the job, or can split
computation continue while the job is running.
    (1) The first version is much easier to implement
    (2) The second version would start the job faster, but both responses to "getNextSplit()"
and recovery is harder, because split computation may not have finished when failover or split
requests happen

Once we know which approach to follow, I think we can go into details.

> Parallel the creation of InputSplit if necessary
> ------------------------------------------------
>
>                 Key: FLINK-10038
>                 URL: https://issues.apache.org/jira/browse/FLINK-10038
>             Project: Flink
>          Issue Type: Improvement
>    Affects Versions: 1.5.0
>            Reporter: 陈梓立
>            Priority: Major
>              Labels: improvement, inputformat, parallel, perfomance
>
> As a continue to the discussion in the PR about parallelize the creation of ExecutionJobVertex
[here|https://github.com/apache/flink/pull/6353].
> [~StephanEwen] suggested that we could parallelize the creation of InputSplit, from which
we gain performance improvements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message