flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-8046) ContinuousFileMonitoringFunction wrongly ignores files with exact same timestamp
Date Fri, 10 Nov 2017 17:42:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-8046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247825#comment-16247825
] 

ASF GitHub Bot commented on FLINK-8046:
---------------------------------------

GitHub user juanmirocks opened a pull request:

    https://github.com/apache/flink/pull/4997

    [FLINK-8046] [flink-streaming-java] Have filter of timestamp compare with strictly SMALLER
(NOT smaller or equal)

    ## What is the purpose of the change
    
    This change fixes the wrong ignoring of files with same exact timestamp. This change also
matches the doc header of the method (`shouldIgnore`): "...if the modification time of the
file is smaller than...".
    
    Without this change, some files with same exact timestamp (because they were written at
the same exact long time) will be ignored, which is unexpected by the user.
    
    Also you would find the funny log of `Ignoring file:/XXX, with mod time= 1510321363000
and global mod time= 1510321363000`
    
    ## Brief change log
    
    * Comparison is done with strictly SMALLER (<)
    
    ## Verifying this change
    
    This change is a trivial rework / code cleanup without any test coverage.
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): no
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no
      - The serializers: no
      - The runtime per-record code paths (performance sensitive): no
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing,
Yarn/Mesos, ZooKeeper: no
      - The S3 file system connector: no
    
    ## Documentation
    
      - Does this pull request introduce a new feature? no
      - If yes, how is the feature documented? JavaDocs

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tagtog/flink master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/4997.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4997
    
----
commit 2db52989fef2455413d42286893c5227983ee74b
Author: Juan Miguel Cejuela <i@juanmi.rocks>
Date:   2017-11-10T16:57:09Z

    compare as strictly SMALLER (not SMALLER OR EQUAL) (as per the doc header "if the modification
time of the file is smaller than")
    
    Otherwise, some files with same exact timestamp (because they were written at the same
exact long time) will be ignored.

----


> ContinuousFileMonitoringFunction wrongly ignores files with exact same timestamp
> --------------------------------------------------------------------------------
>
>                 Key: FLINK-8046
>                 URL: https://issues.apache.org/jira/browse/FLINK-8046
>             Project: Flink
>          Issue Type: Bug
>          Components: Streaming
>    Affects Versions: 1.3.2
>            Reporter: Juan Miguel Cejuela
>              Labels: stream
>             Fix For: 1.5.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The current monitoring of files sets the internal variable `globalModificationTime` to
filter out files that are "older". However, the current test (to check "older") does 
> `boolean shouldIgnore = modificationTime <= globalModificationTime;` (rom `shouldIgnore`)
> The comparison should strictly be SMALLER (NOT smaller or equal). The method documentation
also states "This happens if the modification time of the file is _smaller_ than...".
> The equality acceptance for "older", makes some files with same exact timestamp to be
ignored. The behavior is also non-deterministic, as the first file to be accepted ("first"
being pretty much random) makes the rest of files with same exact timestamp to be ignored.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message