spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cody Koeninger (JIRA)" <>
Subject [jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)
Date Fri, 14 Oct 2016 01:24:21 GMT


Cody Koeninger commented on SPARK-17812:

OK, failing on start is clear (it's really annoying in the case of subscribePattern), but
at least it's clear.  I think that's enough to get started on this ticket, is anyone currently
working on it or can I do it?  Ryan seemed worried that it wouldn't get done in time for the
next release.

It sounds like your current plan is to ignore whatever comes out of KAFKA-3370, which is fine
as long as whatever you do is both clear and equally tunable.  Clarity of semantics can't
be the only criterion of an API, "You can only start at latest offset, period" is clear, but
a crap api.

the only case where we lack sufficient tunability is "Where do I go when the current offsets
are invalid due to retention?".

No, you lack sufficient tunability as to where newly discovered partitions start.  Keep in
mind that those partitions may have been discovered after a significant job downtime.  If
the point of an API is to provide clear semantics to the user, it is not at all clear to me
as a user how I can start those partitions at latest, which I know is possible in the underlying
data model.

The reason I'm belaboring this point now is that you have chosen names (earliest, latest)
for the API currently under discussion that are confusingly similar to the existing auto offset
reset functionality, and you have provided knobs for some, but not all, of the things auto
offset reset currently affects.  This is going to confuse people, it already confuses me.

> More granular control of starting offsets (assign)
> --------------------------------------------------
>                 Key: SPARK-17812
>                 URL:
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Michael Armbrust
> Right now you can only run a Streaming Query starting from either the earliest or latests
offsets available at the moment the query is started.  Sometimes this is a lot of data.  It
would be nice to be able to do the following:
>  - seek to user specified offsets for manually specified topicpartitions

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message