stratos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Imesh Gunaratne <>
Subject Re: [Discuss] Python Agent Improvments
Date Mon, 14 Dec 2015 04:52:04 GMT
IMO the threading issue and Git command execution based on a python library
are high priority. We might need to immediately address them. Others can be
considered as low priority.

Writing agent in Go is a good thought but we might need to evaluate the
effort, advantages & disadvantages first.


On Sun, Dec 13, 2015 at 6:55 PM, Gayan Gunarathne <> wrote:

> Hi Chamila,
> On Sat, Dec 12, 2015 at 1:34 AM, Chamila De Alwis <>
> wrote:
>> Hi devs,
>> During the testing of the Python Cartridge Agent in the past few weeks we
>> were able to identify a few points where performance and functionality
>> could be improved.
>> *1 - Agent's thread utilization  *
>> It was observed that when the agent is kept running for long periods of
>> time, the thread count spawned would go up and reach the thread limit
>> giving the following error.
>> Exception in thread Thread-85:
>> Traceback (most recent call last):
>>   File "/usr/lib/python2.7/", line 810, in __bootstrap_inner
>>   File
>> "/mnt/apache-stratos-python-cartridge-agent-4.1.4/modules/util/",
>> line 65, in run
>>     task_thread.start()
>>   File "/usr/lib/python2.7/", line 745, in start
>>     _start_new_thread(self.__bootstrap, ())
>> error: can't start new thread
>> Although I couldn't pinpoint the exact use case which results in this
>> sudden spike in the thread count, it's most likely caused by the
>> MessageBrokerHeartBeatChecker class inside file. This is a
>> periodic job which checks for the liveliness of the connected message
>> broker.
>> We can get rid of this by implementing a callback method of Paho Mqtt
>> Client's on_disconnect method [1].
>> Furthermore, the ScheduledExecutor class in the
>> file, for each invocation it spawns a new thread.
>> while not self.terminated:
>>     time.sleep(self.delay)
>>     task_thread = Thread(target=self.task.execute_task)
>>     task_thread.start()
>> This is not a good practice. It should rather submit the task to a thread
>> pool. Imesh has an example code of the thread pool usage using the
>> pythonfutures library[2] for Python 2.7.
> +1
>> *2. Decoupling the messaging implementation*
>> When MQTT was chosen initially as the message broker protocol, the PCA
>> was designed to only use MQTT, using the Paho MQTT library[3]. However
>> after coming across several issues in MQTT, it was decided to default back
>> to AMQP as the message broker protocol for Stratos Components. Except for
>> the PCA, since at the time, there was no AMQP 1.0 supported Python library.
>> However, it seems that Apache QPid Proton library[4] provides AMQP 1.0
>> compliant clients for message broker communication.
> It is good if we can introduce the AMQP also as a messaging protocol with
> PCA.  This will provide flexibility for the end use to choose the messaging
> protocol which he prefers. In the case of the Apache QPid we need to
> recheck about the platform independence of that library.
>> We can provide a common interface for messaging and provide
>> implementation for MQTT and AMQP (if Apache QPid Proton proves to be
>> helpful) protocols. This way we can provide the selection of the protocol
>> to the configuration, similar to what we have in the Java based components
>> such as AS, CC etc.
>> Furthermore, this can greatly help if messaging has to be customized to
>> cater for custom protocols such as Apache Kafka[5]. It will only be another
>> implementation of the common interface.
>> *3. Remove dependency on the Git Binary*
>> Currently the PCA manipulates the Git based artifact repository through
>> the Git binary file. This is inefficient for several reasons.
>>    1. Every Git command is executed as a separate process, created using
>>    os.fork(). This duplicates memory and results in needless marshalling
>>    and unmarshalling of input and output among the processes.
>>    2. If several commands are executed upon the Git binary at the same
>>    time (ex: of a multi-tenant application), it can be a performance
>>    bottleneck.
>>    3. This is not platform independent.
>> Therefore it will greatly help if we can go for a Git library for Python
>> which doesn't depend on the Git binary. Dulwich [6] was considered in the
>> past, but then releases had more features TBD. However recentl releases
>> seems  to have fixed a lot of bug reports and features. It also has plans
>> to be re licensed as Apache v2.0 which also would help us by making it
>> shippable.
> +1. May be we can do some research with Dulwich library and introduce this
> to the PCA GIT operations.
>> *4. Use of the maintenance mode*
>> Sajith recently started a discussion on a patching strategy for the PCA
>> (Thread - [Discuss] Suggesting a patching model for PCA). Patching for
>> the PCA involves two different scenarios, offline and online. If instances
>> that are already online needs to be patched the current PCA or Stratos
>> design doesn't allow such a window.
>> Maintenance Mode can be used to signal the Autoscaler that the member has
>> gone to maintenance and scaling decisions should not be taken on that
>> member. While in the maintenance mode, the running PCA is gracefully
>> shutdown and the patched PCA comes up. When it publishes InstanceActivated
>> event, the member will again be involved in the scaling decision process.
>> There are few places that needs further development for this scenario to
>> work.
>>    1. It seems that Stratos would mark a member as Pending Termination
>>    if a maintenance mode event is received.
>>    2. The PCA doesn't support graceful shutdown right now. It doesn't
>>    take any inputs from outside, and the threads spawned by it are not daemon
>>    threads, which results in a process that needs to be killed because the
>>    threads do not terminate when the main thread goes down. It can be designed
>>    that an input to a periodically checked external file or somehow OS Signals
>>    can set flags on the running process to terminate.
>>    3. The capability of a PCA to start on an already activated instance
>>    isn't verified. Based on the status of the member that it resides, it
>>    should/should not publish certain events.
>> *5. Update topology model based on the message broker events*
>> Currently the topology is repopulated everytime the complete topology
>> event is received. This was done as a quick fix to update the topology
>> model. However for large complete topology events, building event objects
>> and updating contexts can be costly, and on the design front, that is not
>> the intended use of the complete topology event. The events should
>> dynamically update the topology status of the agent. For this to happen
>> another task should first be completed.
> I am not clear about this point. Can you explain further? With the
> introduction of the on demand complete topology publish feature feature, we
> don't need to listen for the complete topology event always. Is n't?
>> *6. Verify message broker event classes*
>> Currently all the message broker events are deserialized in to event
>> objects as per to the classes defined in modules/event package. However
>> this doesn't depend or track any changes done to the Stratos Messaging
>> Component, and therefore can quickly be outdated without a hint on the
>> build or the tests. The events have to be verified against their
>> counterparts in the messaging component.
> In the case of the instance life cycle , I think important to check the
> all the events that received in the CA side.
>> *7. Decouple log file publishing protocol*
>> The PCA has a log publishing feature where a specified list of log files
>> will be monitored and entries published to a thrift data receiver. However
>> there can be situations where the server logs (ex: PCA's own log) has to be
>> monitored from outside (ex: monitoring the agent.log of a Docker container
>> inside Kubernetes using FluentD [7], or directly publishing agent.log to
>> ELK). For those situations the log publishing feature is not flexible
>> enough. If we can introduce a pluggable design for the log event publisher,
>> it can solve most of these situations with low effort.
> In the log run , I think we may need to thing about some central logging
> feature support with the PCA.
>> *8. Verify Python 3 compatibility*
>> Python 2.7 is said to be supported until 2020 [8], so this is not a major
>> concern. However since the opinions between Python 2.7 and Python 3 can be
>> political, it might be good to verify and adjust for Python 3
>> compatibility. The PCA was originally written with Python 2.7 in mind,
>> since it is still the most distributed version by default.
>> These changes might not be immediately solved, however they can be
>> critical in tuning some rough spots in the PCA implementation. WDYT? Ideas?
>> [1] -
>> [2] -
>> [3] -
>> [4] -
>> [5] -
>> [6] -
>> [7] -
>> [8] -
>> Regards,
>> Chamila de Alwis
>> Committer and PMC Member - Apache Stratos
>> Software Engineer | WSO2 | +94772207163
>> Blog:
> --
> Gayan Gunarathne
> Technical Lead, WSO2 Inc. (
> Committer & PMC Member, Apache Stratos
> email :  | mobile : +94 775030545 <%2B94%20766819985>

Imesh Gunaratne

Senior Technical Lead, WSO2
Committer & PMC Member, Apache Stratos

View raw message