spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Nadler <jnad...@srcginc.com>
Subject Re: Streaming Backpressure with Multiple Streams
Date Wed, 14 Sep 2016 15:14:29 GMT
So as you were maybe thinking, it only happens with the combination:

Direct Stream only + backpressure = works as expected

4x Receiver on Topic A + Direct Stream on Topic B + backpressure = the
direct stream is throttled even in the absence of scheduling delay

This is using Spark 1.5.0 on CDH.

After it's been running for several minutes if I look at "Input Metadata" I
can see that the direct stream is consuming 1 record / partition / sec.  I
have maxrate set at 10,000 records / partition / sec.

I'll file a bug today unless someone has any ideas?

Thanks!

Jeff


On Fri, Sep 9, 2016 at 5:54 PM, Jeff Nadler <jnadler@srcginc.com> wrote:

> Yes I'll test that next.
>
> On Sep 9, 2016 5:36 PM, "Cody Koeninger" <cody@koeninger.org> wrote:
>
>> Does the same thing happen if you're only using direct stream plus back
>> pressure, not the receiver stream?
>>
>> On Sep 9, 2016 6:41 PM, "Jeff Nadler" <jnadler@srcginc.com> wrote:
>>
>>> Maybe this is a pretty esoteric implementation, but I'm seeing some bad
>>> behavior with backpressure plus multiple Kafka streams / direct streams.
>>>
>>> Here's the scenario:
>>> We have 1 Kafka topic using the reliable receiver (4 receivers, union
>>> the result).    In the same app, we consume another Kafka topic using a
>>> direct stream.
>>>
>>> This may seem strange, but it's necessary in my application to work
>>> around another problem:   Maxrate is set globally in SparkConf.    IMO It
>>> would be more flexible if we could set maxrate for each stream
>>> independently.   Since directstream uses a different config parameter for
>>> maxrate, we get the desired result.
>>>
>>> A bit hacky I know.
>>>
>>> Anyway, we recently turned on backpressure.   It works as expected for
>>> the receiver-based stream.     For the direct stream, it starts out at the
>>> maxrate (as expected) on the first batch.    Then it ratchets down the
>>> consumption until it is eventually consuming 1 record / second / partition.
>>>
>>> This happens even though there's no scheduling delay, and the
>>> receiver-based stream does not appear to be throttled.
>>>
>>> Anyone ever see anything like this?
>>>
>>> Thanks!
>>>
>>> Jeff Nadler
>>> Aerohive Networks
>>>
>>>

Mime
View raw message