spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolae Marasoiu <nicolae.maras...@adswizz.com>
Subject Re: spark.streaming.kafka.maxRatePerPartition for direct stream
Date Fri, 02 Oct 2015 06:01:51 GMT
Hi,


Set 10ms and spark.streaming.backpressure.enabled=true


This should automatically delay the next batch until the current one is processed, or at least
create that balance over a few batches/periods between the consume/process rate vs ingestion
rate.


Nicu

________________________________
From: Cody Koeninger <cody@koeninger.org>
Sent: Thursday, October 1, 2015 11:46 PM
To: Sourabh Chandak
Cc: user
Subject: Re: spark.streaming.kafka.maxRatePerPartition for direct stream

That depends on your job, your cluster resources, the number of seconds per batch...

You'll need to do some empirical work to figure out how many messages per batch a given executor
can handle.  Divide that by the number of seconds per batch.



On Thu, Oct 1, 2015 at 3:39 PM, Sourabh Chandak <sourabh3934@gmail.com<mailto:sourabh3934@gmail.com>>
wrote:
Hi,

I am writing a spark streaming job using the direct stream method for kafka and wanted to
handle the case of checkpoint failure when we'll have to reprocess the entire data from starting.
By default for every new checkpoint it tries to load everything from each partition and that
takes a lot of time for processing. After some searching found out that there exists a config
spark.streaming.kafka.maxRatePerPartition which can be used to tackle this. My question is
what will be a suitable range for this config if we have ~12 million messages in kafka with
maximum message size ~10 MB.

Thanks,
Sourabh


Mime
View raw message