spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cody Koeninger <c...@koeninger.org>
Subject Re: Rebalancing when adding kafka partitions
Date Wed, 17 Aug 2016 01:12:58 GMT
The underlying kafka consumer

On Tue, Aug 16, 2016 at 2:17 PM, Srikanth <srikanth.ht@gmail.com> wrote:
> Yes, SubscribePattern detects new partition. Also, it has a comment saying
>
>> Subscribe to all topics matching specified pattern to get dynamically
>> assigned partitions.
>>  * The pattern matching will be done periodically against topics existing
>> at the time of check.
>>  * @param pattern pattern to subscribe to
>>  * @param kafkaParams Kafka
>
>
> Who does the new partition discover? Underlying kafka consumer or
> spark-streaming-kafka-0-10-assembly??
>
> Srikanth
>
> On Fri, Aug 12, 2016 at 5:15 PM, Cody Koeninger <cody@koeninger.org> wrote:
>>
>> Hrrm, that's interesting. Did you try with subscribe pattern, out of
>> curiosity?
>>
>> I haven't tested repartitioning on the  underlying new Kafka consumer, so
>> its possible I misunderstood something.
>>
>> On Aug 12, 2016 2:47 PM, "Srikanth" <srikanth.ht@gmail.com> wrote:
>>>
>>> I did try a test with spark 2.0 + spark-streaming-kafka-0-10-assembly.
>>> Partition was increased using "bin/kafka-topics.sh --alter" after spark
>>> job was started.
>>> I don't see messages from new partitions in the DStream.
>>>
>>>>     KafkaUtils.createDirectStream[Array[Byte], Array[Byte]] (
>>>>         ssc, PreferConsistent, Subscribe[Array[Byte],
>>>> Array[Byte]](topics, kafkaParams) )
>>>>     .map(r => (r.key(), r.value()))
>>>
>>>
>>> Also, no.of partitions did not increase too.
>>>>
>>>>     dataStream.foreachRDD( (rdd, curTime) => {
>>>>      logger.info(s"rdd has ${rdd.getNumPartitions} partitions.")
>>>
>>>
>>> Should I be setting some parameter/config? Is the doc for new integ
>>> available?
>>>
>>> Thanks,
>>> Srikanth
>>>
>>> On Fri, Jul 22, 2016 at 2:15 PM, Cody Koeninger <cody@koeninger.org>
>>> wrote:
>>>>
>>>> No, restarting from a checkpoint won't do it, you need to re-define the
>>>> stream.
>>>>
>>>> Here's the jira for the 0.10 integration
>>>>
>>>> https://issues.apache.org/jira/browse/SPARK-12177
>>>>
>>>> I haven't gotten docs completed yet, but there are examples at
>>>>
>>>> https://github.com/koeninger/kafka-exactly-once/tree/kafka-0.10
>>>>
>>>> On Fri, Jul 22, 2016 at 1:05 PM, Srikanth <srikanth.ht@gmail.com> wrote:
>>>> > In Spark 1.x, if we restart from a checkpoint, will it read from new
>>>> > partitions?
>>>> >
>>>> > If you can, pls point us to some doc/link that talks about Kafka 0.10
>>>> > integ
>>>> > in Spark 2.0.
>>>> >
>>>> > On Fri, Jul 22, 2016 at 1:33 PM, Cody Koeninger <cody@koeninger.org>
>>>> > wrote:
>>>> >>
>>>> >> For the integration for kafka 0.8, you are literally starting a
>>>> >> streaming job against a fixed set of topicapartitions,  It will
not
>>>> >> change throughout the job, so you'll need to restart the spark job
if
>>>> >> you change kafka partitions.
>>>> >>
>>>> >> For the integration for kafka 0.10 / spark 2.0, if you use subscribe
>>>> >> or subscribepattern, it should pick up new partitions as they are
>>>> >> added.
>>>> >>
>>>> >> On Fri, Jul 22, 2016 at 11:29 AM, Srikanth <srikanth.ht@gmail.com>
>>>> >> wrote:
>>>> >> > Hello,
>>>> >> >
>>>> >> > I'd like to understand how Spark Streaming(direct) would handle
>>>> >> > Kafka
>>>> >> > partition addition?
>>>> >> > Will a running job be aware of new partitions and read from
it?
>>>> >> > Since it uses Kafka APIs to query offsets and offsets are handled
>>>> >> > internally.
>>>> >> >
>>>> >> > Srikanth
>>>> >
>>>> >
>>>
>>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message