spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shushant Arora <shushantaror...@gmail.com>
Subject Re: spark streaming 1.3 with kafka
Date Tue, 01 Sep 2015 17:15:27 GMT
I feel need of pause and resume in streaming app :)

Is there any limit on max queued jobs ? If yes what happens once that limit
reaches? Does job gets killed?


On Tue, Sep 1, 2015 at 10:02 PM, Cody Koeninger <cody@koeninger.org> wrote:

> Sounds like you'd be better off just failing if the external server is
> down, and scripting monitoring / restarting of your job.
>
> On Tue, Sep 1, 2015 at 11:19 AM, Shushant Arora <shushantarora09@gmail.com
> > wrote:
>
>> Since in my app , after processing the events I am posting the events to
>> some external server- if external server is down - I want to backoff
>> consuming from kafka. But I can't stop and restart the consumer since it
>> needs manual effort.
>>
>> Backing off few batches is also not possible -since decision of backoff
>> is based on last batch process status but driver has already computed
>> offsets for next batches - so if I ignore further few batches till external
>> server is back to normal its a dataloss if I cannot reset the offset .
>>
>> So only option seems is to delay the last batch by calling sleep() in
>> foreach rdd method after returning from foreachpartitions transformations.
>>
>> So concern here is further batches will keep enqueening until current
>> slept batch completes. So whats the max size of scheduling queue? Say if
>> server does not come up for hours and my batch size is 5 sec it will
>> enqueue 720 batches .
>> Will that be a issue ?
>>  And is there any setting in directkafkastream to enforce not to call
>> further computes() method after a threshold of scheduling queue size say
>> (50 batches).Once queue size comes back to less than threshold call compute
>> and enqueue the next job.
>>
>>
>>
>>
>>
>> On Tue, Sep 1, 2015 at 8:57 PM, Cody Koeninger <cody@koeninger.org>
>> wrote:
>>
>>> Honestly I'd concentrate more on getting your batches to finish in a
>>> timely fashion, so you won't even have the issue to begin with...
>>>
>>> On Tue, Sep 1, 2015 at 10:16 AM, Shushant Arora <
>>> shushantarora09@gmail.com> wrote:
>>>
>>>> What if I use custom checkpointing. So that I can take care of offsets
>>>> being checkpointed at end of each batch.
>>>>
>>>> Will it be possible then to reset the offset.
>>>>
>>>> On Tue, Sep 1, 2015 at 8:42 PM, Cody Koeninger <cody@koeninger.org>
>>>> wrote:
>>>>
>>>>> No, if you start arbitrarily messing around with offset ranges after
>>>>> compute is called, things are going to get out of whack.
>>>>>
>>>>> e.g. checkpoints are no longer going to correspond to what you're
>>>>> actually processing
>>>>>
>>>>> On Tue, Sep 1, 2015 at 10:04 AM, Shushant Arora <
>>>>> shushantarora09@gmail.com> wrote:
>>>>>
>>>>>> can I reset the range based on some condition - before calling
>>>>>> transformations on the stream.
>>>>>>
>>>>>> Say -
>>>>>> before calling :
>>>>>>  directKafkaStream.foreachRDD(new Function<JavaRDD<byte[][]>,
Void>()
>>>>>> {
>>>>>>
>>>>>> @Override
>>>>>> public Void call(JavaRDD<byte[][]> v1) throws Exception {
>>>>>> v1.foreachPartition(new  VoidFunction<Iterator<byte[][]>>{
>>>>>> @Override
>>>>>> public void call(Iterator<byte[][]> t) throws Exception {
>>>>>> }});}});
>>>>>>
>>>>>> change directKafkaStream's RDD's offset range.(fromOffset).
>>>>>>
>>>>>> I can't do this in compute method since compute would have been
>>>>>> called at current batch queue time - but condition is set at previous
batch
>>>>>> run time.
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 1, 2015 at 7:09 PM, Cody Koeninger <cody@koeninger.org>
>>>>>> wrote:
>>>>>>
>>>>>>> It's at the time compute() gets called, which should be near
the
>>>>>>> time the batch should have been queued.
>>>>>>>
>>>>>>> On Tue, Sep 1, 2015 at 8:02 AM, Shushant Arora <
>>>>>>> shushantarora09@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> In spark streaming 1.3 with kafka- when does driver bring
latest
>>>>>>>> offsets of this run - at start of each batch or at time when
 batch gets
>>>>>>>> queued ?
>>>>>>>>
>>>>>>>> Say few of my batches take longer time to complete than their
batch
>>>>>>>> interval. So some of batches will go in queue. Will driver
waits for
>>>>>>>>  queued batches to get started or just brings the latest
offsets before
>>>>>>>> they even actually started. And when they start running they
will work on
>>>>>>>> old offsets brought at time when they were queued.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message