spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ofir Manor (JIRA)" <>
Subject [jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)
Date Thu, 13 Oct 2016 22:06:20 GMT


Ofir Manor commented on SPARK-17812:

Thanks Cody! great to have a concrete example.
I've some comments, but its mostly bikeshedding
1.  subscribe vs. subscribePattern --> personally, I would combine them both to "subscribe"
- no need to burden the user with the different Kafka API nuances. It can get a list of discreet
topics or a pattern.
2. It would be much clearer if "assign" was called subscribeSomething, so the user would choose
one "subscribe.." and one (or more) "starting...".
Not sure I have a good name though - subscribeCustom?
You can even use the regular subscribe for that (and be smarter with the pattern matching)
- I think it would just work, and if someone tries to be funny (combine astrerix and partitions)
we could just error
3. I like startingTime... pretty neat.
We could hypothetically add {{.option("startingMessages", long)}} to support Michael's "just
start with a 1000 recent messages"...
4. As I said before, I'd rather have all starting* be mutual-exclusive. Yes, it blocks some
edge cases, on purpose,  but make the API and code way clearer (think about startingMessage
interacting with startingOffsets etc).
I think that it would be easier to regret and allow multiple starting* in the future (opening
all sorts of esoteric combinations) than clean it up in the future if users find it confusing
and not needed.
Anyway, as long as it is functional I'm good with it, even if it less aesthetic.

> More granular control of starting offsets (assign)
> --------------------------------------------------
>                 Key: SPARK-17812
>                 URL:
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Michael Armbrust
> Right now you can only run a Streaming Query starting from either the earliest or latests
offsets available at the moment the query is started.  Sometimes this is a lot of data.  It
would be nice to be able to do the following:
>  - seek to user specified offsets for manually specified topicpartitions

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message