spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cody Koeninger <c...@koeninger.org>
Subject Re: spark.streaming.kafka.maxRatePerPartition for direct stream
Date Thu, 01 Oct 2015 20:46:41 GMT
That depends on your job, your cluster resources, the number of seconds per
batch...

You'll need to do some empirical work to figure out how many messages per
batch a given executor can handle.  Divide that by the number of seconds
per batch.



On Thu, Oct 1, 2015 at 3:39 PM, Sourabh Chandak <sourabh3934@gmail.com>
wrote:

> Hi,
>
> I am writing a spark streaming job using the direct stream method for
> kafka and wanted to handle the case of checkpoint failure when we'll have
> to reprocess the entire data from starting. By default for every new
> checkpoint it tries to load everything from each partition and that takes a
> lot of time for processing. After some searching found out that there
> exists a config spark.streaming.kafka.maxRatePerPartition which can be used
> to tackle this. My question is what will be a suitable range for this
> config if we have ~12 million messages in kafka with maximum message size
> ~10 MB.
>
> Thanks,
> Sourabh
>

Mime
View raw message