beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-1542) Need Source/Sink for Spanner
Date Fri, 18 Aug 2017 01:13:00 GMT

    [ https://issues.apache.org/jira/browse/BEAM-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131563#comment-16131563
] 

ASF GitHub Bot commented on BEAM-1542:
--------------------------------------

GitHub user mairbek opened a pull request:

    https://github.com/apache/beam/pull/3729

    [BEAM-1542] Added a preprocessing step to the Cloud Spanner sink.

    The general intuition we follow here: if mutations are presorted by the primary key before
batching, it is more likely that mutations in the batch will end up in the same partition.
It minimizes the number of participants in the distributed transaction on the Cloud Spanner
side and leads to a better throughput.
    
    Mutations are encoded before running other steps to avoid paying the serialization price.
Primary keys are encoded using OrderedCode library, and ApproximateQuantiles transform is
used to sample keys.
    
    Once primary keys are sampled, for each mutation we assign the index of the closest primary
key as a key and group by that key. Range deletes are submitted separately.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mairbek/beam prepro-pr

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/beam/pull/3729.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3729
    
----
commit 7aeb0b0d02c308c690fb598b69a7aec649e4bb89
Author: Mairbek Khadikov <mairbek@google.com>
Date:   2017-07-20T23:22:04Z

    Added a preprocessing step to the Cloud Spanner sink.
    
    The general intuition we follow here: if mutations are presorted by the primary key before
batching, it is more likely that mutations in the batch will end up in the same partition.
It minimizes the number of participants in the distributed transaction on the Cloud Spanner
side and leads to a better throughput.
    
    Mutations are encoded before running other steps to avoid paying the serialization price.
Primary keys are encoded using OrderedCode library, and ApproximateQuantiles transform is
used to sample keys.
    
    Once primary keys are sampled, for each mutation we assign the index of the closest primary
key as a key and group by that key. Range deletes are submitted separately.

----


> Need Source/Sink for Spanner
> ----------------------------
>
>                 Key: BEAM-1542
>                 URL: https://issues.apache.org/jira/browse/BEAM-1542
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-gcp
>            Reporter: Guy Molinari
>            Assignee: Mairbek Khadikov
>
> Is there a source/sink for Spanner in the works?   If not I would gladly give this a
shot.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message