trafficserver-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Giedymin <jason.giedy...@gmail.com>
Subject Re: Look Mom, no locks!
Date Sat, 10 Mar 2012 00:57:35 GMT
Sorry, was using this as an example of an exciting report vs a dull one at work.

-Jason

On Mar 9, 2012, at 7:55 PM, Jason Giedymin <jason.giedymin@gmail.com> wrote:

> Came in last week, check out this awesome bug report. 
> 
> -Jason
> 
> On Mar 2, 2012, at 10:29 PM, "Whitaker, James X. -ND" <James.X.Whitaker.-ND@disney.com>
wrote:
> 
>> I imagine that it isprobably humanly impossible to see THIS bug with code inspection.
 Or even by stepping through it in a debugger.
>>  
>> I must have scoured over this code about a dozen times looking for problems, and
I missed it.  I had to utterly pulverize ATS source code with instrumentation generating ruby
scripts to zero in on it.  What a mess I made.
>>  
>> On a RedHat or CentOS box, go into lib/ts, edit the Makefile changing the gcc optimizer
flags to –E, run make in that lib/ts directory and look at the preprocessor output in the
files ink_queue.o and Allocator.o
>>  
>> You will have to stare, and stare, and stare before you see it.  And you gotta have
the right kind of brain.
>>  
>> ink_freelist_new()
>>  
>> f->head can get grabbed by simultaneous threads/cores that call ink_freelist_new()
at the same instant in time -- which will grab the same memory location on each thread.
>>  
>> The remaining code with all those atomic operators neglects to couple the assignment
of item=f->head to the increments of head within a single lock.
>>  
>> Allocator.o shows that there are no locks.
>>  
>> The more cores used by ATS, the more unstable ATS will be.
>>  
>> ---
>>  
>> I would imagine that the best of the best of the world’s developers have been chasing
the truly baffling side effects of this bug down for years, and probably just plain giving
up.  There are all sorts of odd mitigations in the code.  Lots of mystery crash reports. 
I have to wonder if it is all due to this.
>>  
>> ---
>>  
>> Credit for this bug find goes to the Foundational Technologies Team at Disney, Cameron
Stevens, my supervisor who pushed me (and paid me) and cleared my to do list to make room
to find this thing, of course moi who spent the last several weeks wracking my brains coming
up with one (sometimes absurd) hypothesis after another and one debugging and code scanning
technique after another trying to zero in on it...  and I suppose Lisa providing the incentive
of stressing the need for financial stability before we get married... a serious incentive
to make a name for myself.
>>  
>> ---
>>  
>> I created some ruby scripts to imbed instrumentation in the source code (and lots
of hand cleanup to make it compile... sigh).
>>  
>> Among other hacks:
>>    each continuation is assigned a unique ID (jimw_stamp) when it is created
>>    each InkContInternal::init call creates another unique ID
>>        (because the freepool allocator does a memcopy without constructor invocation).
>>    each time an event was scheduled, the event inherits the jimw_stamp from the continuation
of "origin"
>>    every schedule_* macro invocation and TSContSchedule and a few others kept track
of __FILE__ and __LINE__ that called it, logging that in new member variables of the event
object.
>>    the configure script was altered by replacing "-O2" with "-O0 -ggdb3" for the
release version
>>  
>> When it crashes, thedebugger shows me:
>>     the Event::jimw_stamp taken from the continuation whose associated instruction
stream originated the event
>>     the InkContInternal::jimw_stamp of the receiving continuation
>>     The line and file number in the source code that originated the event that triggered
the crash
>>  
>> The core dump showed an event generated for a continuation that had been created
by TSContCreate in(lets call it) path A through the code.  The event handler that received
the event was the code for handling a completely different continuation, path B.
>>  
>> Both Path A and Path B were based upon continuations that were created with TSContCreate.
 Usually, but not always, the handler function whose code triggered the crash would be code
that was for a continuation that was a "helper" continuation intended to support the HttSM
class or other similar classes.  This Path B code was utterly unprepared to handle the events
from Path A, would do something freaky, and quite naturally trigger an assert.
>>  
>> It took lots of convoluted setup to allow RHEL to generate core dumps of setuid programs...
>>  
>> The big boogeyman, HOWEVER, is that both the incoming EVENT and the receiving mismatched
CONTINUATION had the same "jimw_stamp" tag in them.
>>  
>> WOAH!
>>  
>> Think about that for a moment.
>>  
>> An event that had been assigned the unique tag from the continuation's allocated
by TSContCreate in Path A is received by the code for another contination created by TSContCreate
in Path B.  They share the same guaranteed-by-the-constructor-to-be-unique ID tag.
>>  
>> The only explanation for that possible is that two continuations somehow had the
same jimw_stamp!  The same unique tag.  Physically impossible.  That means that TWO LIVE CONTINUATIONS
had to be occupying the SAME MEMORY SLOT IN THE FREELIST.  One of the continuations is receiving
all the events for both of them!
>>  
>> "YOU HAVE GOT TO BE KIDDING ME!"  I said out loud.  Banged my ankle on something
really hard.  Still nursing the wound.
>>  
>> Um... that would be bad.  Real bad.  Unlikely, I thought.  That would mean a bug
in a freepool allocator...  most likely a missing mutex....  I could not imagine that kind
of thing ever getting through a code review guantlet.  Hundreds of developers have inspected
THAT code, right?
>>  
>> Code inspection shows that there are locks and mutexes and spinlocks, atomic incriments,
atomic cas.....  There is NO WAY that could fail right?  It looks like it should be bulletproof,
but look at all those preprocessor directives!
>>  
>> Well, ink_freelist_new is not the easiest code to figure out.  Take a look at it....
 mind boggling....  After an hour or more I just plain gave up trying to figure out what it
is ACTUALLY DOING.  I went digging through gcc manuals and figured out how to turn on pre-processor
output.  If this code, so full of preprocessor directives, and so full of locks is putting
two thingsinto the same memory location: WHAT ON EARTH IS ACTUALLY GETTING COMPILED?  I edited
the Makefile and put in -E in place of the optimizer flags, and let the preprocessor do it's
thing.  What on earth does the actual C++ code thatis actually getting compiled in look like?
>>  
>> ---
>>  
>> ink_queue.o, preprocessor output
>>  
>>  
>> void *
>> ink_freelist_new(InkFreeList * f)
>>  
>> {
>> # 234 "ink_queue.cc"
>>   head_p item;
>>   head_p next;
>>   int result = 0;
>>  
>>  
>>  
>>  
>>   do {
>>     *((uint64_t*)&(item)) = *((uint64_t*)&(f->head));
>>     if (((void*)(((void*)(((((intptr_t)(item).data)<<16)>>16) | (((~((((intptr_t)(item).data)<<16>>63)-1))>>48)<<48)))))
== __null) {
>>       uint32_t type_size = f->type_size;
>>       uint32_t i;
>> # 255 "ink_queue.cc"
>>       void *newp = __null;
>>  
>>  
>>  
>>  
>>       if (f->alignment)
>>         newp = ink_memalign(f->alignment, f->chunk_size * type_size);
>>       else
>>         newp = ink_malloc(f->chunk_size * type_size);
>>       if (newp)
>>         ink_atomic_increment64(&freelist_allocated_mem, (int64_t) (f->chunk_size
* type_size));;
>>  
>>  
>>  
>>  
>>  
>>  
>>       (item).data = ((((intptr_t)(newp))&0x0000FFFFFFFFFFFFULL) | (((0)&0xFFFFULL)
<< 48));
>> # 291 "ink_queue.cc"
>>       ink_atomic_increment((int *) &f->allocated, f->chunk_size);
>>       ink_atomic_increment64(&fastalloc_mem_total, (int64_t) f->chunk_size
* f->type_size);
>>  
>>  
>>  
>>  
>>  
>>  
>>       for (i = 0; i < f->chunk_size; i++) {
>>         char *a = ((char *) ((void*)(((((intptr_t)(item).data)<<16)>>16)
| (((~((((intptr_t)(item).data)<<16>>63)-1))>>48)<<48)))) + i * type_size;
>>  
>>  
>>  
>>  
>>  
>>         ink_freelist_free(f, a);
>> # 315 "ink_queue.cc"
>>       }
>>  
>>       ink_atomic_increment((int *) &f->count, f->chunk_size);
>>       ink_atomic_increment64(&fastalloc_mem_in_use, (int64_t) f->chunk_size
* f->type_size);
>>  
>>  
>>  
>>  
>>  
>>     } else {
>>       (next).data = ((((intptr_t)(*((void **)((char *)((void*)(((void*)(((((intptr_t)(item).data)<<16)>>16)
| (((~((((intptr_t)(item).data)<<16>>63)-1))>>48)<<48))))) + f->offset))))&0x0000FFFFFFFFFFFFULL)
| ((((((intptr_t)(item).data)>>48) + 1)&0xFFFFULL) << 48));
>>  
>>  
>>       result = ink_atomic_cas64((int64_t *) & f->head.data, item.data, next.data);
>> # 345 "ink_queue.cc"
>>     }
>>   }
>>   while (result == 0);
>>   (void)(!((uintptr_t)((void*)(((void*)(((((intptr_t)(item).data)<<16)>>16)
| (((~((((intptr_t)(item).data)<<16>>63)-1))>>48)<<48)))))&(((uintptr_t)f->alignment)-1)));
>>  
>>   ink_atomic_increment((int *) &f->count, 1);
>>   ink_atomic_increment64(&fastalloc_mem_in_use, (int64_t) f->type_size);
>>  
>>   return ((void*)(((void*)(((((intptr_t)(item).data)<<16)>>16) | (((~((((intptr_t)(item).data)<<16>>63)-1))>>48)<<48)))));
>>  
>> }
>>  
>>  
>> Allocator.o section:
>>  
>> template<class C> inline C * ClassAllocator<C>::alloc()
>> {
>>  
>>  
>>  
>>   void *ptr = ink_freelist_new(&this->fl);
>>  
>>   if (sizeof(C) < 512) {
>>     for (unsigned int i = 0; i < (((sizeof(C))+15)&~15) / sizeof(int64_t);
i++)
>>       ((int64_t *) ptr)[i] = ((int64_t *) &this->proto.typeObject)[i];
>>   } else
>>     memcpy(ptr, &this->proto.typeObject, sizeof(C));
>>   return (C *) ptr;
>> }
>>  
>> InkAPI.cc:
>>  
>> static ClassAllocator<INKContInternal> INKContAllocator("INKContAllocator");
>>  
>> later in InkAPI.cc
>>  
>> ////////////////////////////////////////////////////////////////////
>> //
>> // Continuations
>> //
>> ////////////////////////////////////////////////////////////////////
>>  
>> TSCont
>> TSContCreate(TSEventFunc funcp, TSMutex mutexp)
>> {
>>   // mutexp can be NULL
>>   if (mutexp != NULL)
>>     sdk_assert(sdk_sanity_check_mutex(mutexp) == TS_SUCCESS);
>>  
>>   INKContInternal *i = INKContAllocator.alloc();
>>  
>>   i->init(funcp, mutexp);
>> // printf ("CREATINGCONT: %p  marker: %d\n", i, i->jimw_stamp);
>>   return (TSCont)i;
>> }
>>  
>>  
>> LOOK MOM!  NO LOCKS!!!!!
>>  
>> ---------------------------------
>>  
>> Now, my guess is that for who knows how many years, anyone reviewing the code saw
this in ink_queue.h and assumed the locks were in place.
>>  
>> INK_USE_MUTEX_FOR_FREELISTS is not in actuality defined, it is only defined when
IS_TS_MICRO_BUILD (??!?) is 1, which it is NOT.
>>  
>> If you do a "grep", it shows the line where INK_USE_MUTEX_FOR_FREELISTS is defined.
>>  
>> 161 #if !defined(INK_USE_MUTEX_FOR_FREELISTS)
>> 162   inkcoreapi void *ink_freelist_new(InkFreeList * f);
>> 163   inkcoreapi void ink_freelist_free(InkFreeList * f, void *item);
>> 164 #else                           /* INK_USE_MUTEX_FOR_FREELISTS */
>> 165   inkcoreapi void *ink_freelist_new_wrap(InkFreeList * f);
>> 166   static inline void *ink_freelist_new(InkFreeList * f)
>> 167   {
>> 168     void *retval = NULL;
>> 169
>> 170     ink_mutex_acquire(&(f->inkfreelist_mutex));
>> 171     retval = ink_freelist_new_wrap(f);
>> 172     ink_mutex_release(&(f->inkfreelist_mutex));
>> 173     return retval;
>> 174   }
>> 175
>> 176   inkcoreapi void ink_freelist_free_wrap(InkFreeList * f, void *item);
>> 177   static inline void ink_freelist_free(InkFreeList * f, void *item)
>> 178   {
>> 179     ink_mutex_acquire(&(f->inkfreelist_mutex));
>> 180     ink_freelist_free_wrap(f, item);
>> 181     ink_mutex_release(&(f->inkfreelist_mutex));
>> 182   }
>> 183 #endif /* INK_USE_MUTEX_FOR_FREELISTS */
>>  
>> If you turn it on bydefining INK_USE_MUTEX_FOR_FREELISTS at the top of ink_queue.h,
the code does NOT compile.  This suggests a VERY old bug.  Once you get it to compile by adding
the missing header file, it deadlocks on itself in line 306 of ink_queue.ccwhere it calls
ink_freelist_free instead of ink_freelist_free_wrap....  another sign of a very old bug....
 Well...  Fix that...  fire it up...
>>  
>> Stable...  REAL STABLE  It has been running for over twenty hours now hammered by
my ruby script load generator with no trace of errors where it used to barf and crash in seconds
or minutes, or occasionally an hour.
>>  
>> But I can not get it to show over 56% of CPU with "top" (15-20% max on any given
core), not even with 100 load generators pounding on ATS.  Before it didn't take much of anything
to make it max out all four virtual cores.  This is, of course, on my Mac VMWare Fusion with
the version from 2009...  the inexplicable delays could be some kind of queuing glitch from
what I suspect is the probable absence of a monotonically increasing possibly double buffered
clock on my VM/Linux Kernel.  So I tried it with the 3.0.3 version with Brian's queuing fix,
and it does thesame thing, with query times starting at around 30ms instead of 3 milliseconds...
>>  
>> Turn on the locks, and I get a 5 to 10X performance hit!
>>  
>> I am not sure yet what is up with the degraded performance turning on these mutex
locks...  I would expect a hit because of locking delays... but this is REALLY severe... This
is why MALLOC in a multi-core box is a challenge, because thread locks are a real serious
pain.  What has me slightly baffled is that f->inkfreelist_mutex appears to be local to
each freelist...  they are NOT global locks, and yet it slows down to a snail's pace.
>>  
>> Cameron says it's time to have some other brains look at this one.  That's probably
a good judgement call, because I am pretty wired.  Hard to think straight this Friday afternoon.
>>  
>> ----------------------------------
>>  
>> How to fix this?
>>  
>> Well, tcmalloc has an interesting approach...
>>  
>> Gets me thinking anyway....
>>  
>> Replacing the freepools outright with tcmalloc doesn't seem like an optimal solution.
 Might wind up with a memory fragmentation problem...  Also the current code has been debugged
and is running using the freepool's memcpy of a prototype object in place of a constructor.
 Given how much code there is in ATS, I would not want to force every class that uses the
freepools to suddenly call a constructor for every object...  There is no telling what that
could break if there are bugs in any of the previously unexercised constructors.
>>  
>> So I am thinking to have per thread free pools, that share a global pool.  In that
kind of design, if the per thread free pool has anything available, the memory allocator doesn't
need to lock.  It only needs to lock when grabbing a batch of, say 100 items, from the global
shared pool, or when returning excessivenumbers of freed items back to the global free pool.
 Every thread has to be able to return items back into it's own free pool no matter which
thread allocated it, so I am initially imagining that having a global pool with locking batch
operators feeding thread pools would be the best design.... though there could be better.
>>  
>> One complicated option is having a dedicated thread that was triggered by watermarks
detected by eachthread that owned a pool that would push or pull batches of thread pool objects
out of the global pools.  That is kind of complicated, but it would keep the thread pools
full all the time with never a locking delay.  Assuming I do it right, of course.
>>  
>> Ideas?
>>  
>> ---------------------------------
>>  
>> A couple of years ago, I worked for 21 months at Microsoft's bing.com's UX Platform
Team (since renamed the "Front Door Team" I have been told), on their front line server team
that handled all the search traffic for all of bing.com.  I am a grey haired C++ server performance
andstability WIZARD (for lack of a better term) with industry experience goingback into the
1980's.  I am probably just what the ATS team needs as a committer…
>>  
>> The "-ND" in my name field, means "Not a Disney employee".  I am a contractor here,
with career aspirations.
>>  
>> I have heard from numerous sources that the ATS committer team members are regarded
as celebrities in the field, especially if they hold onto that distinction for any length
of time.  That someone who gets on the ATS committer team just does not wind up on unemployment....
 That Disney management (who seem to like me.... and I just love this place and the people
here...) would absolutely NEVER even think of letting me go.
>>  
>> Some say that the best way to go is to show up with a huge bang and snow your doors
in.
>>  
>> One of the big motivators to take this contract position was something Cameron said:
"we need someone who knows C++ to learn ATS inside and out, which might provide the opportunity
for open source contributions and making a name for yourself".
>>  
>> So... Umm.. HI THERE!
>>  
>>  
> 

Mime
View raw message