spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sourabh Chandak <sourabh3...@gmail.com>
Subject Re: spark.streaming.kafka.maxRatePerPartition for direct stream
Date Fri, 02 Oct 2015 06:10:07 GMT
Thanks Cody, will try to do some estimation.

Thanks Nicolae, will try out this config.

Thanks,
Sourabh

On Thu, Oct 1, 2015 at 11:01 PM, Nicolae Marasoiu <
nicolae.marasoiu@adswizz.com> wrote:

> Hi,
>
>
> Set 10ms and spark.streaming.backpressure.enabled=true
>
>
> This should automatically delay the next batch until the current one is
> processed, or at least create that balance over a few batches/periods
> between the consume/process rate vs ingestion rate.
>
>
> Nicu
>
> ------------------------------
> *From:* Cody Koeninger <cody@koeninger.org>
> *Sent:* Thursday, October 1, 2015 11:46 PM
> *To:* Sourabh Chandak
> *Cc:* user
> *Subject:* Re: spark.streaming.kafka.maxRatePerPartition for direct stream
>
> That depends on your job, your cluster resources, the number of seconds
> per batch...
>
> You'll need to do some empirical work to figure out how many messages per
> batch a given executor can handle.  Divide that by the number of seconds
> per batch.
>
>
>
> On Thu, Oct 1, 2015 at 3:39 PM, Sourabh Chandak <sourabh3934@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am writing a spark streaming job using the direct stream method for
>> kafka and wanted to handle the case of checkpoint failure when we'll have
>> to reprocess the entire data from starting. By default for every new
>> checkpoint it tries to load everything from each partition and that takes a
>> lot of time for processing. After some searching found out that there
>> exists a config spark.streaming.kafka.maxRatePerPartition which can be used
>> to tackle this. My question is what will be a suitable range for this
>> config if we have ~12 million messages in kafka with maximum message size
>> ~10 MB.
>>
>> Thanks,
>> Sourabh
>>
>
>

Mime
View raw message