beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-2154) Writing to large numbers of BigQuery tables causes out-of-memory
Date Thu, 04 May 2017 00:33:04 GMT

    [ https://issues.apache.org/jira/browse/BEAM-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15995960#comment-15995960
] 

ASF GitHub Bot commented on BEAM-2154:
--------------------------------------

GitHub user reuvenlax opened a pull request:

    https://github.com/apache/beam/pull/2883

    [BEAM-2154] Make BigQuery's dynamic-destination support scale to large numbers of destinations

     Generating hundreds or thousands of file write buffers in a single bundle was causing
workers to crash with out of memory. We now detect when too many files have been written in
a bundle, and spill the remaining records to another PCollection. This PCollection is then
grouped by destination before we write the remaining data to files. We shard destination keys
10 ways to prevent hotspotting. Tests of up to 10TB of data (going from 20 output tables up
to 4000) were run, and a sharding factor of 10 seems to work quite well on all runs (and is
noticeably faster than not sharding)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/reuvenlax/incubator-beam bigquery_scalability

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/beam/pull/2883.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2883
    
----
commit 45eb1f8ec1f84a3eefd6d85539e9dc433be4842f
Author: Reuven Lax <relax@google.com>
Date:   2017-04-29T14:33:54Z

    If too many tables are generated in a bundle, spill and group the results before writing
files. Generating hundreds or thousands of file write buffers in a single bundle was causing
workers to crash with out of memory.

----


> Writing to large numbers of BigQuery tables causes out-of-memory 
> -----------------------------------------------------------------
>
>                 Key: BEAM-2154
>                 URL: https://issues.apache.org/jira/browse/BEAM-2154
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-gcp
>            Reporter: Reuven Lax
>            Assignee: Reuven Lax
>             Fix For: First stable release
>
>
> Since all TableRowWriters are created in a single DoFn, the write buffers all exist simultaneously
and use up large amounts of memory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message