stratos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chamila De Alwis <>
Subject Re: [Discuss] Python Agent Improvments
Date Sun, 13 Dec 2015 10:48:15 GMT
Hi Akila,

I agree that's an exciting road to take, but wouldn't we come across the
same issues, design related and trivial ones like choosing a proper
library, if we rewrite this in Go? Since most of the existing agent is
working and proven code, I think at this point crossing over to Go wouldn't
solve our problems.

Furthermore, Go is not a ubiquitous language that comes by default in most
platforms. AFAIR, choosing Python as the implementation language for the
new agent was mainly based on its presence by default in most platforms.

Chamila de Alwis
Committer and PMC Member - Apache Stratos
Software Engineer | WSO2 | +94772207163

On Sat, Dec 12, 2015 at 10:17 AM, Akila Ravihansa Perera <
> wrote:

> Hi Chamila,
> This might sound bit dramatic...but what if we write Cartridge Agent in Go
> Lang? Given the short comings of Python we have had, I'd say it's worth a
> shot.
> Thanks.
> On Sat, Dec 12, 2015 at 1:34 AM, Chamila De Alwis <>
> wrote:
>> Hi devs,
>> During the testing of the Python Cartridge Agent in the past few weeks we
>> were able to identify a few points where performance and functionality
>> could be improved.
>> *1 - Agent's thread utilization  *
>> It was observed that when the agent is kept running for long periods of
>> time, the thread count spawned would go up and reach the thread limit
>> giving the following error.
>> Exception in thread Thread-85:
>> Traceback (most recent call last):
>>   File "/usr/lib/python2.7/", line 810, in __bootstrap_inner
>>   File
>> "/mnt/apache-stratos-python-cartridge-agent-4.1.4/modules/util/",
>> line 65, in run
>>     task_thread.start()
>>   File "/usr/lib/python2.7/", line 745, in start
>>     _start_new_thread(self.__bootstrap, ())
>> error: can't start new thread
>> Although I couldn't pinpoint the exact use case which results in this
>> sudden spike in the thread count, it's most likely caused by the
>> MessageBrokerHeartBeatChecker class inside file. This is a
>> periodic job which checks for the liveliness of the connected message
>> broker.
>> We can get rid of this by implementing a callback method of Paho Mqtt
>> Client's on_disconnect method [1].
>> Furthermore, the ScheduledExecutor class in the
>> file, for each invocation it spawns a new thread.
>> while not self.terminated:
>>     time.sleep(self.delay)
>>     task_thread = Thread(target=self.task.execute_task)
>>     task_thread.start()
>> This is not a good practice. It should rather submit the task to a thread
>> pool. Imesh has an example code of the thread pool usage using the
>> pythonfutures library[2] for Python 2.7.
>> *2. Decoupling the messaging implementation*
>> When MQTT was chosen initially as the message broker protocol, the PCA
>> was designed to only use MQTT, using the Paho MQTT library[3]. However
>> after coming across several issues in MQTT, it was decided to default back
>> to AMQP as the message broker protocol for Stratos Components. Except for
>> the PCA, since at the time, there was no AMQP 1.0 supported Python library.
>> However, it seems that Apache QPid Proton library[4] provides AMQP 1.0
>> compliant clients for message broker communication.
>> We can provide a common interface for messaging and provide
>> implementation for MQTT and AMQP (if Apache QPid Proton proves to be
>> helpful) protocols. This way we can provide the selection of the protocol
>> to the configuration, similar to what we have in the Java based components
>> such as AS, CC etc.
>> Furthermore, this can greatly help if messaging has to be customized to
>> cater for custom protocols such as Apache Kafka[5]. It will only be another
>> implementation of the common interface.
>> *3. Remove dependency on the Git Binary*
>> Currently the PCA manipulates the Git based artifact repository through
>> the Git binary file. This is inefficient for several reasons.
>>    1. Every Git command is executed as a separate process, created using
>>    os.fork(). This duplicates memory and results in needless marshalling
>>    and unmarshalling of input and output among the processes.
>>    2. If several commands are executed upon the Git binary at the same
>>    time (ex: of a multi-tenant application), it can be a performance
>>    bottleneck.
>>    3. This is not platform independent.
>> Therefore it will greatly help if we can go for a Git library for Python
>> which doesn't depend on the Git binary. Dulwich [6] was considered in the
>> past, but then releases had more features TBD. However recentl releases
>> seems  to have fixed a lot of bug reports and features. It also has plans
>> to be re licensed as Apache v2.0 which also would help us by making it
>> shippable.
>> *4. Use of the maintenance mode*
>> Sajith recently started a discussion on a patching strategy for the PCA
>> (Thread - [Discuss] Suggesting a patching model for PCA). Patching for
>> the PCA involves two different scenarios, offline and online. If instances
>> that are already online needs to be patched the current PCA or Stratos
>> design doesn't allow such a window.
>> Maintenance Mode can be used to signal the Autoscaler that the member has
>> gone to maintenance and scaling decisions should not be taken on that
>> member. While in the maintenance mode, the running PCA is gracefully
>> shutdown and the patched PCA comes up. When it publishes InstanceActivated
>> event, the member will again be involved in the scaling decision process.
>> There are few places that needs further development for this scenario to
>> work.
>>    1. It seems that Stratos would mark a member as Pending Termination
>>    if a maintenance mode event is received.
>>    2. The PCA doesn't support graceful shutdown right now. It doesn't
>>    take any inputs from outside, and the threads spawned by it are not daemon
>>    threads, which results in a process that needs to be killed because the
>>    threads do not terminate when the main thread goes down. It can be designed
>>    that an input to a periodically checked external file or somehow OS Signals
>>    can set flags on the running process to terminate.
>>    3. The capability of a PCA to start on an already activated instance
>>    isn't verified. Based on the status of the member that it resides, it
>>    should/should not publish certain events.
>> *5. Update topology model based on the message broker events*
>> Currently the topology is repopulated everytime the complete topology
>> event is received. This was done as a quick fix to update the topology
>> model. However for large complete topology events, building event objects
>> and updating contexts can be costly, and on the design front, that is not
>> the intended use of the complete topology event. The events should
>> dynamically update the topology status of the agent. For this to happen
>> another task should first be completed.
>> *6. Verify message broker event classes*
>> Currently all the message broker events are deserialized in to event
>> objects as per to the classes defined in modules/event package. However
>> this doesn't depend or track any changes done to the Stratos Messaging
>> Component, and therefore can quickly be outdated without a hint on the
>> build or the tests. The events have to be verified against their
>> counterparts in the messaging component.
>> *7. Decouple log file publishing protocol*
>> The PCA has a log publishing feature where a specified list of log files
>> will be monitored and entries published to a thrift data receiver. However
>> there can be situations where the server logs (ex: PCA's own log) has to be
>> monitored from outside (ex: monitoring the agent.log of a Docker container
>> inside Kubernetes using FluentD [7], or directly publishing agent.log to
>> ELK). For those situations the log publishing feature is not flexible
>> enough. If we can introduce a pluggable design for the log event publisher,
>> it can solve most of these situations with low effort.
>> *8. Verify Python 3 compatibility*
>> Python 2.7 is said to be supported until 2020 [8], so this is not a major
>> concern. However since the opinions between Python 2.7 and Python 3 can be
>> political, it might be good to verify and adjust for Python 3
>> compatibility. The PCA was originally written with Python 2.7 in mind,
>> since it is still the most distributed version by default.
>> These changes might not be immediately solved, however they can be
>> critical in tuning some rough spots in the PCA implementation. WDYT? Ideas?
>> [1] -
>> [2] -
>> [3] -
>> [4] -
>> [5] -
>> [6] -
>> [7] -
>> [8] -
>> Regards,
>> Chamila de Alwis
>> Committer and PMC Member - Apache Stratos
>> Software Engineer | WSO2 | +94772207163
>> Blog:
> --
> Akila Ravihansa Perera
> WSO2 Inc.;
> Blog:

View raw message