spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cody Koeninger (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)
Date Thu, 13 Oct 2016 19:29:21 GMT

    [ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572922#comment-15572922
] 

Cody Koeninger commented on SPARK-17812:
----------------------------------------

Sorry, I didn't see this comment until just now.

X offsets back per partition is not a reasonable proxy for time when you're dealing with a
stream that has multiple topics in it.  Agree we should break that out, focus on defining
starting offsets in this ticket.

The concern with startingOffsets naming is that, because auto.offset.reset is orthogonal to
specifying some offsets, you have a situation like this:

.format("kafka")
.option("subscribePattern", "topic.*")
.option("startingOffset", "latest")
.option("startingOffsetForRealzYo", """ { "topicfoo" : { "0": 1234, "1": 4567 }, "topicbar"
: { "0": 1234, "1": 4567 }}""")

where startingOffsetForRealzYo has a more reasonable name that conveys it is specifying starting
offsets, yet is not confusingly similar to startingOffset

Trying to hack it all into one json as an alternative, with a "default" topic, means you're
going to have to pick a key that isn't a valid topic, or add yet another layer of indirection.
 It also makes it harder to make the format consistent with SPARK-17829 (which seems like
a good thing to keep consistent, I agree)

Obviously I think you should just change the name, but it's your show.





> More granular control of starting offsets (assign)
> --------------------------------------------------
>
>                 Key: SPARK-17812
>                 URL: https://issues.apache.org/jira/browse/SPARK-17812
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the earliest or latests
offsets available at the moment the query is started.  Sometimes this is a lot of data.  It
would be nice to be able to do the following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message