spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saisai Shao <sai.sai.s...@gmail.com>
Subject Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?
Date Fri, 12 Jun 2015 05:54:07 GMT
OK, I get it, I think currently Python based Kafka direct API do not
provide such equivalence like Scala, maybe we should figure out to add this
into Python API also.

2015-06-12 13:48 GMT+08:00 Amit Ramesh <amit@yelp.com>:

>
> Hi Jerry,
>
> Take a look at this example:
> https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2
>
> The offsets are needed because as RDDs get generated within spark the
> offsets move further along. With direct Kafka mode the current offsets are
> no more persisted in Zookeeper but rather within Spark itself. If you want
> to be able to use zookeeper based monitoring tools to keep track of
> progress, then this is needed.
>
> In my specific case we need to persist Kafka offsets externally so that we
> can continue from where we left off after a code deployment. In other
> words, we need exactly-once processing guarantees across code deployments.
> Spark does not support any state persistence across deployments so this is
> something we need to handle on our own.
>
> Hope that helps. Let me know if not.
>
> Thanks!
> Amit
>
>
> On Thu, Jun 11, 2015 at 10:02 PM, Saisai Shao <sai.sai.shao@gmail.com>
> wrote:
>
>> Hi,
>>
>> What is your meaning of getting the offsets from the RDD, from my
>> understanding, the offsetRange is a parameter you offered to KafkaRDD, why
>> do you still want to get the one previous you set into?
>>
>> Thanks
>> Jerry
>>
>> 2015-06-12 12:36 GMT+08:00 Amit Ramesh <amit@yelp.com>:
>>
>>>
>>> Congratulations on the release of 1.4!
>>>
>>> I have been trying out the direct Kafka support in python but haven't
>>> been able to figure out how to get the offsets from the RDD. Looks like the
>>> documentation is yet to be updated to include Python examples (
>>> https://spark.apache.org/docs/latest/streaming-kafka-integration.html).
>>> I am specifically looking for the equivalent of
>>> https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2.
>>> I tried digging through the python code but could not find anything
>>> related. Any pointers would be greatly appreciated.
>>>
>>> Thanks!
>>> Amit
>>>
>>>
>>
>

Mime
View raw message