spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erwan ALLAIN <eallain.po...@gmail.com>
Subject Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark
Date Tue, 19 Apr 2016 15:21:24 GMT
I'm describing a disaster recovery but it can be used to make one
datacenter offline for upgrade for instance.

>From my point of view when DC2 crashes:

*On Kafka side:*
- kafka cluster will lose one or more broker (partition leader and replica)
- partition leader lost will be reelected in the remaining healthy DC

=> if the number of in-sync replicas are above the minimum threshold, kafka
should be operational

*On downstream datastore side (say Cassandra for instance):*
- deploy accross the 2 DCs in (QUORUM / QUORUM)
- idempotent write

=> it should be ok (depends on replication factor)

*On Spark*:
- treatment should be idempotent, it will allow us to restart from the last
commited offset

I understand that starting up a post crash job would work.

Question is: how can we detect when DC2 crashes to start a new job ?

dynamic topic partition (at each kafkaRDD creation for instance) + topic
subscription may be the answer ?

I appreciate your effort.

On Tue, Apr 19, 2016 at 4:25 PM, Jason Nerothin <jasonnerothin@gmail.com>
wrote:

> It the main concern uptime or disaster recovery?
>
> On Apr 19, 2016, at 9:12 AM, Cody Koeninger <cody@koeninger.org> wrote:
>
> I think the bigger question is what happens to Kafka and your downstream
> data store when DC2 crashes.
>
> From a Spark point of view, starting up a post-crash job in a new data
> center isn't really different from starting up a post-crash job in the
> original data center.
>
> On Tue, Apr 19, 2016 at 3:32 AM, Erwan ALLAIN <eallain.poctu@gmail.com>
> wrote:
>
>> Thanks Jason and Cody. I'll try to explain a bit better the Multi DC case.
>>
>> As I mentionned before, I'm planning to use one kafka cluster and 2 or
>> more spark cluster distinct.
>>
>> Let's say we have the following DCs configuration in a nominal case.
>> Kafka partitions are consumed uniformly by the 2 datacenters.
>>
>> DataCenter Spark Kafka Consumer Group Kafka partition (P1 to P4)
>> DC 1 Master 1.1
>>
>> Worker 1.1 my_group P1
>> Worker 1.2 my_group P2
>> DC 2 Master 2.1
>>
>> Worker 2.1 my_group P3
>> Worker 2.2 my_group P4
>> I would like, in case of DC crash, a rebalancing of partition on the
>> healthy DC, something as follow
>>
>> DataCenter Spark Kafka Consumer Group Kafka partition (P1 to P4)
>> DC 1 Master 1.1
>>
>> Worker 1.1 my_group P1*, P3*
>> Worker 1.2 my_group P2*, P4*
>> DC 2 Master 2.1
>>
>> Worker 2.1 my_group P3
>> Worker 2.2 my_group P4
>>
>> I would like to know if it's possible:
>> - using consumer group ?
>> - using direct approach ? I prefer this one as I don't want to activate
>> WAL.
>>
>> Hope the explanation is better !
>>
>>
>> On Mon, Apr 18, 2016 at 6:50 PM, Cody Koeninger <cody@koeninger.org>
>> wrote:
>>
>>> The current direct stream only handles exactly the partitions
>>> specified at startup.  You'd have to restart the job if you changed
>>> partitions.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-12177 has the ongoing work
>>> towards using the kafka 0.10 consumer, which would allow for dynamic
>>> topicparittions
>>>
>>> Regarding your multi-DC questions, I'm not really clear on what you're
>>> saying.
>>>
>>> On Mon, Apr 18, 2016 at 11:18 AM, Erwan ALLAIN <eallain.poctu@gmail.com>
>>> wrote:
>>> > Hello,
>>> >
>>> > I'm currently designing a solution where 2 distinct clusters Spark (2
>>> > datacenters) share the same Kafka (Kafka rack aware or manual broker
>>> > repartition).
>>> > The aims are
>>> > - preventing DC crash: using kafka resiliency and consumer group
>>> mechanism
>>> > (or else ?)
>>> > - keeping consistent offset among replica (vs mirror maker,which does
>>> not
>>> > keep offset)
>>> >
>>> > I have several questions
>>> >
>>> > 1) Dynamic repartition (one or 2 DC)
>>> >
>>> > I'm using KafkaDirectStream which map one partition kafka with one
>>> spark. Is
>>> > it possible to handle new or removed partition ?
>>> > In the compute method, it looks like we are always using the
>>> currentOffset
>>> > map to query the next batch and therefore it's always the same number
>>> of
>>> > partition ? Can we request metadata at each batch ?
>>> >
>>> > 2) Multi DC Spark
>>> >
>>> > Using Direct approach, a way to achieve this would be
>>> > - to "assign" (kafka 0.9 term) all topics to the 2 sparks
>>> > - only one is reading the partition (Check every x interval, "lock"
>>> stored
>>> > in cassandra for instance)
>>> >
>>> > => not sure if it works just an idea
>>> >
>>> > Using Consumer Group
>>> > - CommitOffset manually at the end of the batch
>>> >
>>> > => Does spark handle partition rebalancing ?
>>> >
>>> > I'd appreciate any ideas ! Let me know if it's not clear.
>>> >
>>> > Erwan
>>> >
>>> >
>>>
>>
>>
>
>

Mime
View raw message