beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reuven Lax (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-5426) Use both destination and TableDestination for BQ load job IDs
Date Tue, 18 Sep 2018 22:08:00 GMT

    [ https://issues.apache.org/jira/browse/BEAM-5426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619792#comment-16619792
] 

Reuven Lax commented on BEAM-5426:
----------------------------------

If different destinations return the same TableDestination, worse things can happen. In that
case parallel loads to the same table might happen from different workers (since we distribute
based on the destination), which can cause data corruption (e.g. if the disposition is set
to WRITE_TRUNCATE).

> Use both destination and TableDestination for BQ load job IDs
> -------------------------------------------------------------
>
>                 Key: BEAM-5426
>                 URL: https://issues.apache.org/jira/browse/BEAM-5426
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-gcp
>            Reporter: Chamikara Jayalath
>            Priority: Major
>
> Currently we use TableDestination when creating a unique load job ID for a destination:
[https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryHelpers.java#L359]
>  
> This can result in a data loss issue if a user returns the same TableDestination for
different destination IDs. I think we can prevent this if we include both IDs in the BQ load
job ID.
>  
> CC: [~reuvenlax]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message