beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amit Sela (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-848) Make post-read (unbounded) shuffle use coders instead of Kryo.
Date Fri, 04 Nov 2016 22:20:58 GMT

    [ https://issues.apache.org/jira/browse/BEAM-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15637890#comment-15637890
] 

Amit Sela commented on BEAM-848:
--------------------------------

I have this patched-up but I'm waiting to find the time to actually test and see this break
before the patch, and run successfully after.
Breaking this should be easy enough if the input from the {{UnboundedSource}} is not Kryo-serializable.

> Make post-read (unbounded) shuffle use coders instead of Kryo.
> --------------------------------------------------------------
>
>                 Key: BEAM-848
>                 URL: https://issues.apache.org/jira/browse/BEAM-848
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-spark
>            Reporter: Amit Sela
>            Assignee: Amit Sela
>
> The SparkRunner uses {{mapWithState}} to read and manage CheckpointMarks, and this stateful
operation will be followed by a shuffle: 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/MapWithStateDStream.scala#L159
> It would be best to use coders here: https://github.com/apache/incubator-beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/io/SparkUnboundedSource.java#L71
in order to get an Iterator<byte[]> and decode afterwards.
> This method is preferred for two reasons:
> 1. Known coding should be faster then Kryo.
> 2. Users are already required to explicitly use coders when authoring a Pipeline and
the runner should avoid adding complexity of "registering" additional serializers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message