spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cody Koeninger <c...@koeninger.org>
Subject Re: How to force Spark Kafka Direct to start from the latest offset when the lag is huge in kafka 10?
Date Tue, 22 Aug 2017 14:20:54 GMT
Kafka rdds need to start from a specified offset, you really don't
want the executors just starting at whatever offset happened to be
latest at the time they ran.

If you need a way to figure out the latest offset at the time the
driver starts up, you can always use a consumer to read the offsets
and then pass that to Assign (just make sure that consumer is closed
before the job starts so you don't get group id conflicts).  You can
even make your own implementation of ConsumerStrategy, which should
allow you to do pretty much whatever you need to get the consumer in
the state you want.

On Mon, Aug 21, 2017 at 6:57 PM, swetha kasireddy
<swethakasireddy@gmail.com> wrote:
> Hi Cody,
>
> I think the Assign is used if we want it to start from a specified offset.
> What if we want it to start it from the latest offset with something like
> returned by "auto.offset.reset" -> "latest",.
>
>
> Thanks!
>
> On Mon, Aug 21, 2017 at 9:06 AM, Cody Koeninger <cody@koeninger.org> wrote:
>>
>> Yes, you can start from specified offsets.  See ConsumerStrategy,
>> specifically Assign
>>
>>
>> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#your-own-data-store
>>
>> On Tue, Aug 15, 2017 at 1:18 PM, SRK <swethakasireddy@gmail.com> wrote:
>> > Hi,
>> >
>> > How to force Spark Kafka Direct to start from the latest offset when the
>> > lag
>> > is huge in kafka 10? It seems to be processing from the latest offset
>> > stored
>> > for a group id. One way to do this is to change the group id. But it
>> > would
>> > mean that each time that we need to process the job from the latest
>> > offset
>> > we have to provide a new group id.
>> >
>> > Is there a way to force the job to run from the latest offset in case we
>> > need to and still use the same group id?
>> >
>> > Thanks!
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-force-Spark-Kafka-Direct-to-start-from-the-latest-offset-when-the-lag-is-huge-in-kafka-10-tp29071.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message