beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Work logged] (BEAM-5404) Inefficient Serialization of Spanner MutationGroup in pipeline
Date Fri, 21 Sep 2018 15:25:01 GMT

     [ https://issues.apache.org/jira/browse/BEAM-5404?focusedWorklogId=146412&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-146412
]

ASF GitHub Bot logged work on BEAM-5404:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 21/Sep/18 15:24
            Start Date: 21/Sep/18 15:24
    Worklog Time Spent: 10m 
      Work Description: nielm commented on issue #6407: [BEAM-5404] Use Java serialization
for MutationGroup objects.
URL: https://github.com/apache/beam/pull/6407#issuecomment-423572158
 
 
   This PR is irrelevant  and will be withdrawn:  it was a bug in my testing which indicated
that Java serialization is more efficient
   
   My mutations were 10 columns of 10K strings... but the values were _the same_ 10K string.

   ie: String stringValue = new String( /* 10K char array */)
   
   Mutation m = Mutation.newInsertOrUpdateBuilder("table1")
       .set("key").to(UUID.randomUUID().toString())
       .set("value0").to(stringValue)
       .set("value1").to(stringValue)
       .set("value2").to(stringValue)
   // etc
   
   So when the custom serializer encoded this, it produced a ~100K byte array, 
   Java serialization was being clever: it only sees one String object to be serialized and
produced a ~10K byte array...
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 146412)
    Time Spent: 40m  (was: 0.5h)

> Inefficient Serialization of Spanner MutationGroup in pipeline
> --------------------------------------------------------------
>
>                 Key: BEAM-5404
>                 URL: https://issues.apache.org/jira/browse/BEAM-5404
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-gcp
>    Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.6.0
>            Reporter: Niel Markwick
>            Assignee: Chamikara Jayalath
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> The Cloud Spanner connector uses a custom serialization mechanism to convert MutationGroup
objects into a byte array. 
> This mechanism is very inefficient producing byte arrays approx 10x larger than simple
Java Serialization of the MutationGroup objects, which increases the resources needed by the
connector to ~40x the size of the original mutations.
> There are no obvious benefits to using this custom serialization system, as the objects
are deserialized within the pipeline itself. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message