beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chamikara Jayalath (JIRA)" <j...@apache.org>
Subject [jira] [Created] (BEAM-522) Update FileSink.finalize_write() to be idempotent
Date Wed, 03 Aug 2016 21:04:20 GMT
Chamikara Jayalath created BEAM-522:
---------------------------------------

             Summary: Update FileSink.finalize_write() to be idempotent
                 Key: BEAM-522
                 URL: https://issues.apache.org/jira/browse/BEAM-522
             Project: Beam
          Issue Type: Bug
          Components: sdk-py
            Reporter: Chamikara Jayalath
            Assignee: Chamikara Jayalath


Currently FileSink.finelize_write() in fileio.py [1] performs following operations.

(1) Obtains a list of temporary files as a side input
(2) Renames each temporary file to the location where final output should be stored.

iobase.Sink.finalize_write() operation should be idempotent since runner implementations may
call this operation multiple times due to task failures. 

Current implementation is not idempotent because if we re-run the operation after renaming
a sub-set of files, the operations may fail due to not being able to find some files at source
location (for example, [2] for GCS files).

We can fix this by checking if the destination file is already available before performing
the rename and not performing the rename for files that are already available at the destination.

[1] https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/fileio.py#L503

[2] https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/gcsio.py#L187




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message