beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Kirpichov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-3268) getPerDestinationOutputFilenames() is getting processed before write is finished on dataflow runner
Date Wed, 29 Nov 2017 01:25:00 GMT

    [ https://issues.apache.org/jira/browse/BEAM-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269876#comment-16269876
] 

Eugene Kirpichov commented on BEAM-3268:
----------------------------------------

Yeah this is a bug, because the transforms that produce perDestinationOutputFilenames produce
them before the files are actually copied, e.g. https://github.com/apache/beam/blob/f8d8ff14c49e4dfb15541f4b73aa66513c9a9d23/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L942

One fix is to reorder that code (and the respective code in FinalizeWindowedFn). Another fix
is to insert a reshuffle somewhere around https://github.com/apache/beam/blob/f8d8ff14c49e4dfb15541f4b73aa66513c9a9d23/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L816
, which is less brittle - I would prefer the latter.

> getPerDestinationOutputFilenames() is getting processed before write is finished on dataflow
runner
> ---------------------------------------------------------------------------------------------------
>
>                 Key: BEAM-3268
>                 URL: https://issues.apache.org/jira/browse/BEAM-3268
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-dataflow
>    Affects Versions: 2.3.0
>            Reporter: Kamil Szewczyk
>            Assignee: Reuven Lax
>         Attachments: comparison.png
>
>
> While running filebased-io-test we found dataflow-runnner misbehaving. We run tests using
single pipeline and without using Reshuffling between writing and reading dataflow jobs are
unsuccessful because the runner tries to access the files that were not created yet. 
> On the picture the difference between execution of writting is presented. On the left
there is working example with Reshuffling added and on the right without it.
> !comparison.png|thumbnail!
> Steps to reproduce: substitute your-bucket-name wit your valid bucket.
> {code:java}
> mvn -e -Pio-it verify -pl sdks/java/io/file-based-io-tests -DintegrationTestPipelineOptions='["--runner=dataflow",
"--filenamePrefix=gs://your-bucket-name/TEXTIO_IT"]' -Pdataflow-runner
> {code}
> Then look on the cloud console and job should fail.
> Now add Reshuffling to sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
as in the example.
> {code:java}
> .getPerDestinationOutputFilenames().apply(Values.<String>create())
>         .apply(Reshuffle.<String>viaRandomKey());
>     PCollection<String> consolidatedHashcode = testFilenames
> {code}
> and trigger previously used maven command to see it working in the console right now.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message