spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saisai Shao <sai.sai.s...@gmail.com>
Subject Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?
Date Fri, 12 Jun 2015 06:05:23 GMT
Scala KafkaRDD uses a trait to handle this problem, but it is not so easy
and straightforward in Python, where we need to have a specific API to
handle this, I'm not sure is there any simple workaround to fix this, maybe
we should think carefully about it.

2015-06-12 13:59 GMT+08:00 Amit Ramesh <amit@yelp.com>:

>
> Thanks, Jerry. That's what I suspected based on the code I looked at. Any
> pointers on what is needed to build in this support would be great. This is
> critical to the project we are currently working on.
>
> Thanks!
>
>
> On Thu, Jun 11, 2015 at 10:54 PM, Saisai Shao <sai.sai.shao@gmail.com>
> wrote:
>
>> OK, I get it, I think currently Python based Kafka direct API do not
>> provide such equivalence like Scala, maybe we should figure out to add this
>> into Python API also.
>>
>> 2015-06-12 13:48 GMT+08:00 Amit Ramesh <amit@yelp.com>:
>>
>>>
>>> Hi Jerry,
>>>
>>> Take a look at this example:
>>> https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2
>>>
>>> The offsets are needed because as RDDs get generated within spark the
>>> offsets move further along. With direct Kafka mode the current offsets are
>>> no more persisted in Zookeeper but rather within Spark itself. If you want
>>> to be able to use zookeeper based monitoring tools to keep track of
>>> progress, then this is needed.
>>>
>>> In my specific case we need to persist Kafka offsets externally so that
>>> we can continue from where we left off after a code deployment. In other
>>> words, we need exactly-once processing guarantees across code deployments.
>>> Spark does not support any state persistence across deployments so this is
>>> something we need to handle on our own.
>>>
>>> Hope that helps. Let me know if not.
>>>
>>> Thanks!
>>> Amit
>>>
>>>
>>> On Thu, Jun 11, 2015 at 10:02 PM, Saisai Shao <sai.sai.shao@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> What is your meaning of getting the offsets from the RDD, from my
>>>> understanding, the offsetRange is a parameter you offered to KafkaRDD, why
>>>> do you still want to get the one previous you set into?
>>>>
>>>> Thanks
>>>> Jerry
>>>>
>>>> 2015-06-12 12:36 GMT+08:00 Amit Ramesh <amit@yelp.com>:
>>>>
>>>>>
>>>>> Congratulations on the release of 1.4!
>>>>>
>>>>> I have been trying out the direct Kafka support in python but haven't
>>>>> been able to figure out how to get the offsets from the RDD. Looks like
the
>>>>> documentation is yet to be updated to include Python examples (
>>>>> https://spark.apache.org/docs/latest/streaming-kafka-integration.html).
>>>>> I am specifically looking for the equivalent of
>>>>> https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2.
>>>>> I tried digging through the python code but could not find anything
>>>>> related. Any pointers would be greatly appreciated.
>>>>>
>>>>> Thanks!
>>>>> Amit
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message