spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cody Koeninger <c...@koeninger.org>
Subject Re: spark streaming 1.3 with kafka
Date Tue, 01 Sep 2015 15:12:25 GMT
No, if you start arbitrarily messing around with offset ranges after
compute is called, things are going to get out of whack.

e.g. checkpoints are no longer going to correspond to what you're actually
processing

On Tue, Sep 1, 2015 at 10:04 AM, Shushant Arora <shushantarora09@gmail.com>
wrote:

> can I reset the range based on some condition - before calling
> transformations on the stream.
>
> Say -
> before calling :
>  directKafkaStream.foreachRDD(new Function<JavaRDD<byte[][]>, Void>() {
>
> @Override
> public Void call(JavaRDD<byte[][]> v1) throws Exception {
> v1.foreachPartition(new  VoidFunction<Iterator<byte[][]>>{
> @Override
> public void call(Iterator<byte[][]> t) throws Exception {
> }});}});
>
> change directKafkaStream's RDD's offset range.(fromOffset).
>
> I can't do this in compute method since compute would have been called at
> current batch queue time - but condition is set at previous batch run time.
>
>
> On Tue, Sep 1, 2015 at 7:09 PM, Cody Koeninger <cody@koeninger.org> wrote:
>
>> It's at the time compute() gets called, which should be near the time the
>> batch should have been queued.
>>
>> On Tue, Sep 1, 2015 at 8:02 AM, Shushant Arora <shushantarora09@gmail.com
>> > wrote:
>>
>>> Hi
>>>
>>> In spark streaming 1.3 with kafka- when does driver bring latest offsets
>>> of this run - at start of each batch or at time when  batch gets queued ?
>>>
>>> Say few of my batches take longer time to complete than their batch
>>> interval. So some of batches will go in queue. Will driver waits for
>>>  queued batches to get started or just brings the latest offsets before
>>> they even actually started. And when they start running they will work on
>>> old offsets brought at time when they were queued.
>>>
>>>
>>
>

Mime
View raw message