spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Startin <richardstar...@outlook.com>
Subject Re: Back-pressure to Spark Kafka Streaming?
Date Mon, 05 Dec 2016 22:58:47 GMT
I've seen the feature work very well. For tuning, you've got:

spark.streaming.backpressure.pid.proportional (defaults to 1, non-negative) - weight for response
to "error" (change between last batch and this batch)
spark.streaming.backpressure.pid.integral (defaults to 0.2, non-negative) - weight for the
response to the accumulation of error. This has a dampening effect.
spark.streaming.backpressure.pid.derived (defaults to zero, non-negative) - weight for the
response to the trend in error. This can cause arbitrary/noise-induced fluctuations in batch
size, but can also help react quickly to increased/reduced capacity.
spark.streaming.backpressure.pid.minRate - the default value is 100 (must be positive), batch
size won't go below this.

spark.streaming.receiver.maxRate - batch size won't go above this.


Cheers,

Richard


https://richardstartin.com/


________________________________
From: Liren Ding <sky.gonna.bright@gmail.com>
Sent: 05 December 2016 22:18
To: dev@spark.apache.org; user@spark.apache.org
Subject: Back-pressure to Spark Kafka Streaming?

Hey all,

Does backressure actually work on spark kafka streaming? According to the latest spark streaming
document:
http://spark.apache.org/docs/latest/streaming-programming-guide.html
"In Spark 1.5, we have introduced a feature called backpressure that eliminate the need to
set this rate limit, as Spark Streaming automatically figures out the rate limits and dynamically
adjusts them if the processing conditions change. This backpressure can be enabled by setting
the configuration parameter spark.streaming.backpressure.enabled to true."
But I also see a few open spark jira tickets on this option:
https://issues.apache.org/jira/browse/SPARK-7398
https://issues.apache.org/jira/browse/SPARK-18371

The case in the second ticket describes a similar issue as we have here. We use Kafka to send
large batches (10~100M) to spark streaming, and the spark streaming interval is set to 1~4
minutes. With the backpressure set to true, the queued active batches still pile up when average
batch processing time takes longer than default interval. After the spark driver is restarted,
all queued batches turn to a giant batch, which block subsequent batches and also have a great
chance to fail eventually. The only config we found that might help is "spark.streaming.kafka.maxRatePerPartition".
It does limit the incoming batch size, but not a perfect solution since it depends on size
of partition as well as the length of batch interval. For our case, hundreds of partitions
X minutes of interval still produce a number that is too large for each batch. So we still
want to figure out how to make the backressure work in spark kafka streaming, if it is supposed
to work there. Thanks.


Liren








Mime
View raw message