trafficserver-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Peach <jpe...@apache.org>
Subject Re: Ownership of proxy.process.ssl.total_success_handshake_count
Date Fri, 08 Jul 2016 22:44:47 GMT

> On Jul 8, 2016, at 9:59 AM, Shrihari Kalkar <Shrihari.Kalkar@riverbed.com> wrote:
> 
> 
> On 7/8/16, 9:19 AM, "James Peach" <jpeach@apache.org> wrote:
> 
>> 
>>> On Jul 7, 2016, at 1:16 PM, Shrihari Kalkar
>>> <Shrihari.Kalkar@riverbed.com> wrote:
>>> 
>>> James,
>>> Thank you for clarifications. Please find my responses inline.
>>> 
>>> On 7/7/16, 11:56 AM, "James Peach" <jpeach@apache.org> wrote:
>>> 
>>>> 
>>>>> On Jul 7, 2016, at 9:55 AM, Shrihari Kalkar
>>>>> <Shrihari.Kalkar@riverbed.com> wrote:
>>>>> 
>>>>> Guys,
>>>>> I understand that in code, if a stat is registered under
>>>>> 'proxy.process' it is owned by traffic_server.
>>>> 
>>>> That is by convention. In the code RECT_PROCESS metrics are owned by
>>>> traffic_server. Ownership of metrics is not a particularly strong
>>>> concept
>>>> though :-/
>>> Cool!
>>> 
>>>> 
>>>>> Also, if a stat is registered under stats.config.xml or
>>>>> metrics.config,
>>>>> the 'destination' is owned by traffic_manager.
>>>>> 
>>>>> Since proxy.process.ssl.total_success_handshake_count is listed in
>>>>> stats.config.xml and metrics.config, do you think that this is a
>>>>> conflict of ownership?
>>>> 
>>>> Grovelling through the git history will probably clear this up for you.
>>>> IIRC total_success_handshake_count was moved to the derived metrics for
>>>> compatibility when total_success_handshake_count_in and
>>>> total_success_handshake_count_out were introduced. Maybe it makes sense
>>>> now to make total_success_handshake_count the sum of in and out?
>>> 
>>> Yes, I went through https://issues.apache.org/jira/browse/TS-3409 to
>>> understand why this was added. I agree that it should be a derived
>>> metric
>>> but I found it odd that it was the only derived metric that had a
>>> destination set to 'proxy.process'.
>>>> 
>>>>> I see that on my instance, traffic_manager keeps sending an 'RECG_SET'
>>>>> event for this stat to traffic_server even when nothing is happening.
>>>>> As
>>>>> a result, I am seeing an issue where traffic_server blocks for a very
>>>>> long time for initializing.
>>>> 
>>>> Hmm. I'm not sure that slow initialization would be related to this.
>>>> One
>>>> common cause of initialization lag is setting
>>>> proxy.config.http.wait_for_cache. I think you need to debug this one
>>>> further :)
>>> 
>>> I did some analysis there and here is what I found out. Hopefully I am
>>> not
>>> too off on the wrong path.
>>> 
>>> As part of startup, traffic_server creates two threads (to begin with).
>>> 1. (main) Thread (1) blocks till its signaled by another thread
>> 
>> Where do you see this happening?
> 
> I see this happening in the following code path:
> Main()->initialize_process_manager()->RecProcessInitMessage()
> if (mode_type == RECM_CLIENT) {
> 	send_pull_message(RECG_PULL_REQ);
> 	g_force_req_notify.lock();
> 	g_force_req_notify.wait(); <--- Waits for signal from another thread
> 	g_force_req_notify.unlock();
>  }

Ah. Here's what I see on startup:

[Jul  8 22:25:15.472] Manager {0x7f8d77f4c7c0} NOTE: [LocalManager::startProxy] Launching
ts process
traffic_server: using root directory '/n/trafficserver'
[Jul  8 22:25:15.482] Manager {0x7f8d77f4c7c0} NOTE: [LocalManager::pollMgmtProcessServer]
New process connecting fd '13'
[Jul  8 22:25:15.482] Manager {0x7f8d77f4c7c0} NOTE: [Alarms::signalAlarm] Server Process
born
[Jul  8 22:25:16.483] Manager {0x7f8d77f4c7c0} NOTE: (rec) [recv] RECG_PULL_REQ [24 bytes]
[Jul  8 22:25:16.485] Manager {0x7f8d77f4c7c0} NOTE: (rec) [send] RECG_PULL_ACK [259040 bytes]

I'm a little surprised that the pulled records add up to that much, but it doesn't take a
long time on this particular VM (8G 2 cores).

The RECG_PULL_ACK message sends records where
	- manager is the owner (config, node, cluster, etc) OR
	- it is an unregistered metric
	- it is a non-persistent metric

In the first case, the in-memory configuration state can be different from the on-disk state,
so it probably makes sense to want to know about the current config records as early as possible.

The second case might be an optimization since I guess you don't need to know about unregistered
metrics (yet).

The third case seems pretty wonky to me. traffic_server will only load the persistent metrics
in RECM_STAND_ALONE mode, so there's no guarantee that it already knows the persistent metrics.
I don't see an obvious reason why it would need to not get them at this time. It is good to
have updated values for all the metrics when we start so that we can increment correctly,
etc.

>>> 1. Thread 2 polls for messages from traffic_manager
>>> 
>>> traffic_server is waiting for a message from traffic_manager which
>>> contains all the configuration required for it to go ahead with
>>> initialization. Hence, I understand that it is critical that the main
>>> Thread (1) wait till it gets the configuration.
>>> 
>>> Thread 2 which polls for message from traffic_manager works like this:
>>> for(;;) {
>>> if (pmgmt->require_lm) {     <--- Always True (for us)
>>>   pmgmt->pollLMConnection(); 	<--- |for (count = 0; count < 10000;
>>> count
>>> ++) 
>>> 				<--- |  num = mgmt_read_timeout(...) <---- Blocking call. returns 0
>>> if
>>> nothing was received for 1 second
>>> 				<--- |  if !num: break <--- Break out of the loop and return from
>>> function 
>>> 				<--- |  else: read(fd), add_to_event_queue, continue the loop <---
>>> Back to fetching another message
>>> }
>>> pmgmt->processEventQueue();  <--  Process the messages received in
>>> pollLMConnection()
>>> pmgmt->processSignalQueue();
>>> mgmt_sleep_sec(pmgmt->timeout);
>>> }
>>> 
>>> I found two possible problems here:
>>> 
>>> 1. When I looked into the above code, I observed that the
>>> pollLMConnection
>>> might not return back for a very long time if it keeps getting messages.
>>> As a result, we may not call processEventQueue() (till we receive 10k
>>> messages) which processes the received messages. And unless we process
>>> the
>>> messages, we cannot signal the main Thread (1) to continue, which is
>>> still
>>> blocked. Hence I saw the issue where traffic_server won't complete
>>> initialization for a very long time.
>> 
>> Yes, this could definitely be improved. I'd be happy to help shepherd a
>> patch if this is something you would like to work on.
> 
> Sure, I have a patch in mind that I am testing. Let me do some more
> testing on it and then I will reach out to you to understand the patch
> process.

Great! There's some info in the wiki <https://cwiki.apache.org/confluence/display/TS/Development+Process>.
These days we generally do things with github pull requests. Feel free to ask questions on
IRC in #traffic-server too.

cheers,
James



Mime
View raw message