beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kyle Winkelman (JIRA)" <>
Subject [jira] [Commented] (BEAM-5519) Spark Streaming Duplicated Encoding/Decoding Effort
Date Thu, 27 Sep 2018 20:43:00 GMT


Kyle Winkelman commented on BEAM-5519:

// GroupCombineFunctions.groupByKeyOnly
JavaRDD<WindowedValue<KV<K, V>>>
JavaRDD<WindowedValue<KV<K, WindowedValue<V>>>>
JavaRDD<KV<K, WindowedValue<V>>>
JavaPairRDD<K, WindowedValue<V>>
JavaPairRDD<ByteArray, byte[]>
JavaPairRdd<ByteArray, Iterable<byte[]>> // groupByKey
JavaPairRDD<K, Iterable<WindowedValue<V>>>
JavaRDD<KV<K, Iterable<WindowedValue<V>>>>
JavaRDD<WindowedValue<KV<K, Iterable<WindowedValue<V>>>>>

// SparkGroupAlsoByWindowViaWindowSet.buildPairDStream
JavaRDD<KV<K, Iterable<WindowedValue<V>>>>
JavaPairRDD<K, Iterable<WindowedValue<V>>>
JavaPairRDD<K, KV<Long, Iterable<WindowedValue<V>>>>
JavaPairRDD<ByteArray, byte[]>

// UpdateStateByKeyOutputIterator.computeNext
gets the scala.collection.Seq<byte[]> the seq of values that have the same key
decoded to scala.collection.Seq<KV<Long, Iterable<WindowedValue<V>>>>
(zero or one items because we have already grouped by key)
get the head of the Seq and pull out the Iterable<WindowedValue<V>>

> Spark Streaming Duplicated Encoding/Decoding Effort
> ---------------------------------------------------
>                 Key: BEAM-5519
>                 URL:
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-spark
>            Reporter: Kyle Winkelman
>            Assignee: Kyle Winkelman
>            Priority: Major
>              Labels: spark, spark-streaming
> When using the SparkRunner in streaming mode. There is a call to groupByKey followed
by a call to updateStateByKey. BEAM-1815 fixed an issue where this used to cause 2 shuffles
but it still causes 2 encode/decode cycles.

This message was sent by Atlassian JIRA

View raw message