trafficserver-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Peach <jpe...@apache.org>
Subject Re: Ownership of proxy.process.ssl.total_success_handshake_count
Date Fri, 08 Jul 2016 16:19:21 GMT

> On Jul 7, 2016, at 1:16 PM, Shrihari Kalkar <Shrihari.Kalkar@riverbed.com> wrote:
> 
> James,
> Thank you for clarifications. Please find my responses inline.
> 
> On 7/7/16, 11:56 AM, "James Peach" <jpeach@apache.org> wrote:
> 
>> 
>>> On Jul 7, 2016, at 9:55 AM, Shrihari Kalkar
>>> <Shrihari.Kalkar@riverbed.com> wrote:
>>> 
>>> Guys,
>>> I understand that in code, if a stat is registered under
>>> 'proxy.process' it is owned by traffic_server.
>> 
>> That is by convention. In the code RECT_PROCESS metrics are owned by
>> traffic_server. Ownership of metrics is not a particularly strong concept
>> though :-/
> Cool!
> 
>> 
>>> Also, if a stat is registered under stats.config.xml or metrics.config,
>>> the 'destination' is owned by traffic_manager.
>>> 
>>> Since proxy.process.ssl.total_success_handshake_count is listed in
>>> stats.config.xml and metrics.config, do you think that this is a
>>> conflict of ownership?
>> 
>> Grovelling through the git history will probably clear this up for you.
>> IIRC total_success_handshake_count was moved to the derived metrics for
>> compatibility when total_success_handshake_count_in and
>> total_success_handshake_count_out were introduced. Maybe it makes sense
>> now to make total_success_handshake_count the sum of in and out?
> 
> Yes, I went through https://issues.apache.org/jira/browse/TS-3409 to
> understand why this was added. I agree that it should be a derived metric
> but I found it odd that it was the only derived metric that had a
> destination set to 'proxy.process'.
>> 
>>> I see that on my instance, traffic_manager keeps sending an 'RECG_SET'
>>> event for this stat to traffic_server even when nothing is happening. As
>>> a result, I am seeing an issue where traffic_server blocks for a very
>>> long time for initializing.
>> 
>> Hmm. I'm not sure that slow initialization would be related to this. One
>> common cause of initialization lag is setting
>> proxy.config.http.wait_for_cache. I think you need to debug this one
>> further :)
> 
> I did some analysis there and here is what I found out. Hopefully I am not
> too off on the wrong path.
> 
> As part of startup, traffic_server creates two threads (to begin with).
> 1. (main) Thread (1) blocks till its signaled by another thread

Where do you see this happening?

> 1. Thread 2 polls for messages from traffic_manager
> 
> traffic_server is waiting for a message from traffic_manager which
> contains all the configuration required for it to go ahead with
> initialization. Hence, I understand that it is critical that the main
> Thread (1) wait till it gets the configuration.
> 
> Thread 2 which polls for message from traffic_manager works like this:
> for(;;) {
>  if (pmgmt->require_lm) {     <--- Always True (for us)
>    pmgmt->pollLMConnection(); 	<--- |for (count = 0; count < 10000; count
> ++) 
> 				<--- |  num = mgmt_read_timeout(...) <---- Blocking call. returns 0 if
> nothing was received for 1 second
> 				<--- |  if !num: break <--- Break out of the loop and return from
> function 
> 				<--- |  else: read(fd), add_to_event_queue, continue the loop <---
> Back to fetching another message
>  }
>  pmgmt->processEventQueue();  <--  Process the messages received in
> pollLMConnection()
>  pmgmt->processSignalQueue();
>  mgmt_sleep_sec(pmgmt->timeout);
> }
> 
> I found two possible problems here:
> 
> 1. When I looked into the above code, I observed that the pollLMConnection
> might not return back for a very long time if it keeps getting messages.
> As a result, we may not call processEventQueue() (till we receive 10k
> messages) which processes the received messages. And unless we process the
> messages, we cannot signal the main Thread (1) to continue, which is still
> blocked. Hence I saw the issue where traffic_server won't complete
> initialization for a very long time.

Yes, this could definitely be improved. I'd be happy to help shepherd a patch if this is something
you would like to work on.

> 2. The second problem I wanted to figure out was why is traffic_server
> receiving so many messages at boot-up? I added some debug logs and I found
> that traffic_server kept getting 'RECG' events for the metric
> 'proxy.process.ssl.total_success_handshake_count'. Narrowing it down
> further (in int RecSetRecord(...))
> 	From traffic_manager's perspective:
> 	   if (i_am_not_owner_of(stat)) : send traffic_server a notify message.
> Ideally, this code should not be triggered because, traffic_manager
> controls updates to 'proxy.process.ssl.total_success_handshake_count'.
> However, the ownership in the code is determined solely based on the
> 'string name'. Since the stat name contains 'process', it assumes that it
> is owned by traffic_server. This leads to an interesting scenario where
> traffic_manger keeps updating its own stat and sends unnecessary events to
> traffic_server. I am not still not sure how and when these updates happen.

In Evaluator::eval() we evaluate all the Lua derived metrics and update them using RecSetRecord().
RecSetRecord() will end up sending a RECG_SET message for this metric since it is not the
owner. The traffic_manager event loop evaluates the metrics approximately every second (which
is overkill since the per-thread metric caches are not sampled that quickly).

> But if the updates happen frequently, due to 'Problem 1' we can prevent
> traffic_server from processing any messages for up to 10,000 seconds! (I
> am imagining the case where the message is received just before the
> timeout of 1 second happens)
> 
> 
> I see this situation very often on a Virtual machine for some reason. Most
> of the times, traffic_cop comes and kills traffic_server because it took
> too long for it to initialize.
> 
> I verified that by removing this derived metric, the problem went away.
> However, I have a feeling that we need to fix the 'problem 1' as well.
> 
> 
> 
>> 
>>> I wanted to know if my understanding of ownership is correct. And if it
>>> is, can I open a new bug to track the definition of this stat?
>> 
>> Yes, it sounds like it is worth revisiting SSL stats to make sure they
>> still make sense. Last time I looked at these, there were a few problems.
> 
> Sure, let me open a bug for this. Even if it ends up being invalid, the
> discussion can be helpful.
>> 
>> J
>> 
>> 
> Thanks,
> Shrihari
> 


Mime
View raw message