qpid-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kerry Bonin <kerrybo...@gmail.com>
Subject Re: Qpid post-mortem and request for suggestions for (my) next release challenge (10M msgs/sec on Windows)
Date Mon, 17 Jun 2013 18:22:45 GMT
On Mon, Jun 17, 2013 at 7:26 AM, Gordon Sim <gsim@redhat.com> wrote:

> On 06/14/2013 03:58 PM, Kerry Bonin wrote:
>> On existing broker failover - can you point me to where that behavior is
>> documented?  Because neither myself or anyone on the four teams I work
>> with
>> has come across the functionality you describe.  I've never seen a client
>> failover to another broker, only code to attempt to reconnect.
> It appears the reconnect_urls connection option is not in fact documented.
> Sorry about that. It takes a single url or a Variant::List of urls to try
> when reconnecting.
>   Basic
>> features we need:
>> - externally adjustable retry / timeout on connections - to handle
>> differences between LAN, WAN, and satellite internet.
>> - updating broker list: How do you do this?  Never seen it...
> There are two options. The first is that any url in the AMQP 0-10 format
> can itself contain multiple hosts, e.g. amqp:tcp:host1:port1,host2:**port2.
> The second is to use the reconnect_urls option as above.
> (When used in conjunction with the failover exchange there is a helper
> class that will receive updates and apply them:
> http://qpid.apache.org/books/**0.20/Programming-In-Apache-**
> Qpid/html/ch02s14.html<http://qpid.apache.org/books/0.20/Programming-In-Apache-Qpid/html/ch02s14.html>,
> something similar could be done for some other distribution mechanism).
>  - to prevent network splits, how are recovered brokers monitored?  When a
>> failed broker recovers, do clients switch back?  How often / aggressively
>> checked?
> No, there is no switch back behaviour in the client. The new HA code
> allows a broker to be classed as in a backup or primary role and backups
> will reject or kick off any clients causing them to failover. Whatever
> cluster management solution was in use would then detect changes to primary
> and use QMF to tell each broker what their role was.

I'd like to suggest that this is a serious deficiency.  It would be nice if
it was possible to have some HA features without having to deploy
clustering.  While the lack of clustering for Windows makes this an obvious
problem for Windows users, I'd certainly argue that *nix users might also
like to have failover and recovery without clustering.  And without
clustering, failover without recovery is kind of useless as a HA feature
due to the split use case.  (i.e. 2 clients talking through broker A,
broker A fails and 2 clients failover to broker B.  Broker A comes back
online.  Another client joins, connects to broker A.  We now have a split,
new client cannot see old clients.)

For me, this means we have to leave our layer library in place.

>  - how is the application notified on broker failure, connection failover,
>> recovery?
> It isn't. Any threads using the connection will essentially block until
> either the connection was re-established or until the configured limit was
> reached and the client gives up trying.
> Now I write this I do recall a conversation on this topic with you some
> time back, with this being an issue for you.

I'd like to suggest that this remains a serious deficiency.  In most
software, if a critical failure occurs down in middleware or its supporting
infrastructure, it would be nice if the middleware library could report
this to the application, so a system administrator could do something about
it.  While its certainly possible to rely on external monitoring systems to
notify an admin, its also a good practice to have an application display
some sort of error condition.  A broker failure in an ESB SOA application
is a critical failure, and the application needs to inform its user that it
has lost connectivity to the system.

For me, this also means we have to leave our layer library in place.

>  Finally, we were ending up with LOTS of application complexity in SOA code
>> when broker failure / recovery meant connection, sender and receiver
>> objects had to be recreated.  This was compounded by Connection being a
>> different types of boost object than senders and receivers.
> That's strange. Those classes all use the Handle template so I can't see
> how they would be different in that regard. I don't suppose you recall the
> details?

If I remember the root issue - if a broker fails, the Connection object
dies, and so do all the associated Sender and Receiver objects.  If I
failover to another broker, I obtain a new Connection object, and must then
obtain new Sender and Receiver objects from this new Connection object.
 This results in a decent bit of complexity propagating through our code,
which was another reason why we created a wrapper over it - so we could
call one send API (that stores a map of queue names to Sender objects) and
use listener callbacks that register a list of callbacks associated with a
queue name Receiver objects.  On failover/recovery we recreate the Senders
and Receivers once in one place, and update the maps.  If the applications
care about connection state, they ask for our status callbacks, as per
comment above.

>  And anything you can think of for dynamically load balancing across
>> brokers?
> Honestly, I think the simplest solution overall is for us to get
> federation working on windows. I assume its some issue in the IO layer.
> Does anyone have a concrete understanding of what the problem is and what
> is required to fix it?
> Any volunteers from our windows experts to take a look (Cliff, Chuck,
> Andrew, Steve)?
>  Greatly appreciate the feedback and input...
> Likewise!
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: dev-unsubscribe@qpid.apache.**org<dev-unsubscribe@qpid.apache.org>
> For additional commands, e-mail: dev-help@qpid.apache.org

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message