Hi James, 
Did you ever open a ticket in Jira for this so it can be tracked?


On Fri, Mar 9, 2012 at 4:59 PM, Whitaker, James X. -ND <James.X.Whitaker.-ND@disney.com> wrote:
Yes….  It is a pretty interesting find.  But WAY too much tension has built up around it.

From: Jason Giedymin <jason.giedymin@gmail.com>
Reply-To: "users@trafficserver.apache.org" <users@trafficserver.apache.org>
Date: Fri, 9 Mar 2012 16:57:35 -0800
To: "users@trafficserver.apache.org" <users@trafficserver.apache.org>
Subject: Re: Look Mom, no locks!

Sorry, was using this as an example of an exciting report vs a dull one at work.


On Mar 9, 2012, at 7:55 PM, Jason Giedymin <jason.giedymin@gmail.com> wrote:

Came in last week, check out this awesome bug report. 


On Mar 2, 2012, at 10:29 PM, "Whitaker, James X. -ND" <James.X.Whitaker.-ND@disney.com> wrote:

I imagine that it isprobably humanly impossible to see THIS bug with code inspection.  Or even by stepping through it in a debugger.


I must have scoured over this code about a dozen times looking for problems, and I missed it.  I had to utterly pulverize ATS source code with instrumentation generating ruby scripts to zero in on it.  What a mess I made.


On a RedHat or CentOS box, go into lib/ts, edit the Makefile changing the gcc optimizer flags to –E, run make in that lib/ts directory and look at the preprocessor output in the files ink_queue.o and Allocator.o


You will have to stare, and stare, and stare before you see it.  And you gotta have the right kind of brain.




f->head can get grabbed by simultaneous threads/cores that call ink_freelist_new() at the same instant in time -- which will grab the same memory location on each thread.


The remaining code with all those atomic operators neglects to couple the assignment of item=f->head to the increments of head within a single lock.


Allocator.o shows that there are no locks.


The more cores used by ATS, the more unstable ATS will be.




I would imagine that the best of the best of the world’s developers have been chasing the truly baffling side effects of this bug down for years, and probably just plain giving up.  There are all sorts of odd mitigations in the code.  Lots of mystery crash reports.  I have to wonder if it is all due to this.




Credit for this bug find goes to the Foundational Technologies Team at Disney, Cameron Stevens, my supervisor who pushed me (and paid me) and cleared my to do list to make room to find this thing, of course moi who spent the last several weeks wracking my brains coming up with one (sometimes absurd) hypothesis after another and one debugging and code scanning technique after another trying to zero in on it...  and I suppose Lisa providing the incentive of stressing the need for financial stability before we get married... a serious incentive to make a name for myself.




I created some ruby scripts to imbed instrumentation in the source code (and lots of hand cleanup to make it compile... sigh).


Among other hacks:

   each continuation is assigned a unique ID (jimw_stamp) when it is created

   each InkContInternal::init call creates another unique ID

       (because the freepool allocator does a memcopy without constructor invocation).

   each time an event was scheduled, the event inherits the jimw_stamp from the continuation of "origin"

   every schedule_* macro invocation and TSContSchedule and a few others kept track of __FILE__ and __LINE__ that called it, logging that in new member variables of the event object.

   the configure script was altered by replacing "-O2" with "-O0 -ggdb3" for the release version


When it crashes, thedebugger shows me:

    the Event::jimw_stamp taken from the continuation whose associated instruction stream originated the event

    the InkContInternal::jimw_stamp of the receiving continuation

    The line and file number in the source code that originated the event that triggered the crash


The core dump showed an event generated for a continuation that had been created by TSContCreate in(lets call it) path A through the code. The event handler that received the event was the code for handling a completely different continuation, path B.


Both Path A and Path B were based upon continuations that were created with TSContCreate.  Usually, but not always, the handler function whose code triggered the crash would be code that was for a continuation that was a "helper" continuation intended to support the HttSM class or other similar classes.  This Path B code was utterly unprepared to handle the events from Path A, would do something freaky, and quite naturally trigger an assert.


It took lots of convoluted setup to allow RHEL to generate core dumps of setuid programs...


The big boogeyman, HOWEVER, is that both the incoming EVENT and the receiving mismatched CONTINUATION had the same "jimw_stamp" tag in them.




Think about that for a moment.


An event that had been assigned the unique tag from the continuation's allocated by TSContCreate in Path A is received by the code for another contination created by TSContCreate in Path B.  They share the same guaranteed-by-the-constructor-to-be-unique ID tag.


The only explanation for that possible is that two continuations somehow had the same jimw_stamp!  The same unique tag.  Physically impossible.  That means that TWO LIVE CONTINUATIONS had to be occupying the SAME MEMORY SLOT IN THE FREELIST.  One of the continuations is receiving all the events for both of them!


"YOU HAVE GOT TO BE KIDDING ME!"  I said out loud.  Banged my ankle on something really hard.  Still nursing the wound.


Um... that would be bad.  Real bad.  Unlikely, I thought.  That would mean a bug in a freepool allocator...  most likely a missing mutex....  I could not imagine that kind of thing ever getting through a code review guantlet.  Hundreds of developers have inspected THAT code, right?


Code inspection shows that there are locks and mutexes and spinlocks, atomic incriments, atomic cas.....  There is NO WAY that could fail right?  It looks like it should be bulletproof, but look at all those preprocessor directives!


Well, ink_freelist_new isnot the easiest code to figure out.  Take a look at it....  mind boggling....  After an hour or more I just plain gave up trying to figure out what it is ACTUALLY DOING.  I went digging through gcc manuals and figured out how to turn on pre-processor output.  If this code, so full of preprocessor directives, and so full of locks is putting two thingsinto the same memory location: WHAT ON EARTH IS ACTUALLY GETTING COMPILED?  I edited the Makefile and put in -E in place of the optimizer flags, and let the preprocessor do it's thing.  What on earth does the actual C++ code thatis actually getting compiled in look like?




ink_queue.o, preprocessor output



void *

ink_freelist_new(InkFreeList * f)



# 234 "ink_queue.cc"

  head_p item;

  head_p next;

  int result = 0;





  do {

    *((uint64_t*)&(item)) = *((uint64_t*)&(f->head));

    if (((void*)(((void*)(((((intptr_t)(item).data)<<16)>>16) | (((~((((intptr_t)(item).data)<<16>>63)-1))>>48)<<48))))) == __null) {

      uint32_t type_size = f->type_size;

      uint32_t i;

# 255 "ink_queue.cc"

      void *newp = __null;





      if (f->alignment)

        newp = ink_memalign(f->alignment, f->chunk_size * type_size);


        newp = ink_malloc(f->chunk_size * type_size);

      if (newp)

        ink_atomic_increment64(&freelist_allocated_mem, (int64_t) (f->chunk_size * type_size));;







      (item).data = ((((intptr_t)(newp))&0x0000FFFFFFFFFFFFULL) | (((0)&0xFFFFULL) << 48));

# 291 "ink_queue.cc"

      ink_atomic_increment((int *) &f->allocated, f->chunk_size);

      ink_atomic_increment64(&fastalloc_mem_total, (int64_t) f->chunk_size * f->type_size);







      for (i = 0; i < f->chunk_size; i++) {

        char *a = ((char *) ((void*)(((((intptr_t)(item).data)<<16)>>16) | (((~((((intptr_t)(item).data)<<16>>63)-1))>>48)<<48)))) + i * type_size;






        ink_freelist_free(f, a);

# 315 "ink_queue.cc"



      ink_atomic_increment((int *) &f->count, f->chunk_size);

      ink_atomic_increment64(&fastalloc_mem_in_use, (int64_t) f->chunk_size * f->type_size);






    } else {

      (next).data = ((((intptr_t)(*((void **)((char *)((void*)(((void*)(((((intptr_t)(item).data)<<16)>>16) | (((~((((intptr_t)(item).data)<<16>>63)-1))>>48)<<48))))) + f->offset))))&0x0000FFFFFFFFFFFFULL) | ((((((intptr_t)(item).data)>>48) + 1)&0xFFFFULL) << 48));



      result = ink_atomic_cas64((int64_t *) & f->head.data, item.data, next.data);

# 345 "ink_queue.cc"



  while (result == 0);

  (void)(!((uintptr_t)((void*)(((void*)(((((intptr_t)(item).data)<<16)>>16) | (((~((((intptr_t)(item).data)<<16>>63)-1))>>48)<<48)))))&(((uintptr_t)f->alignment)-1)));


  ink_atomic_increment((int *) &f->count, 1);

  ink_atomic_increment64(&fastalloc_mem_in_use, (int64_t) f->type_size);


  return ((void*)(((void*)(((((intptr_t)(item).data)<<16)>>16) | (((~((((intptr_t)(item).data)<<16>>63)-1))>>48)<<48)))));





Allocator.o section:


template<class C> inline C * ClassAllocator<C>::alloc()





  void *ptr = ink_freelist_new(&this->fl);


  if (sizeof(C) < 512) {

    for (unsigned int i = 0; i < (((sizeof(C))+15)&~15) / sizeof(int64_t); i++)

      ((int64_t *) ptr)[i] = ((int64_t *) &this->proto.typeObject)[i];

  } else

    memcpy(ptr, &this->proto.typeObject, sizeof(C));

  return (C *) ptr;





static ClassAllocator<INKContInternal> INKContAllocator("INKContAllocator");


later in InkAPI.cc




// Continuations





TSContCreate(TSEventFunc funcp, TSMutex mutexp)


  // mutexp can be NULL

  if (mutexp != NULL)

    sdk_assert(sdk_sanity_check_mutex(mutexp) == TS_SUCCESS);


  INKContInternal *i = INKContAllocator.alloc();


  i->init(funcp, mutexp);

// printf ("CREATINGCONT: %p  marker: %d\n", i, i->jimw_stamp);

  return (TSCont)i;








Now, my guess is that for who knows how many years, anyone reviewing the code saw this in ink_queue.h and assumed the locks were in place.


INK_USE_MUTEX_FOR_FREELISTS is not in actuality defined, it is only defined when IS_TS_MICRO_BUILD (??!?) is 1, which it is NOT.


If you do a "grep", it shows the line where INK_USE_MUTEX_FOR_FREELISTS is defined.



162   inkcoreapi void *ink_freelist_new(InkFreeList * f);

163   inkcoreapi void ink_freelist_free(InkFreeList * f, void *item);

164 #else                           /* INK_USE_MUTEX_FOR_FREELISTS */

165   inkcoreapi void *ink_freelist_new_wrap(InkFreeList * f);

166   static inline void *ink_freelist_new(InkFreeList * f)

167   {

168     void *retval = NULL;


170     ink_mutex_acquire(&(f->inkfreelist_mutex));

171     retval = ink_freelist_new_wrap(f);

172     ink_mutex_release(&(f->inkfreelist_mutex));

173     return retval;

174   }


176   inkcoreapi void ink_freelist_free_wrap(InkFreeList * f, void *item);

177   static inline void ink_freelist_free(InkFreeList * f, void *item)

178   {

179     ink_mutex_acquire(&(f->inkfreelist_mutex));

180     ink_freelist_free_wrap(f, item);

181     ink_mutex_release(&(f->inkfreelist_mutex));

182   }



If you turn it on bydefining INK_USE_MUTEX_FOR_FREELISTS at the top of ink_queue.h, the code does NOT compile.  This suggests a VERY old bug.  Once you get it to compile by adding the missing header file, it deadlocks on itself in line 306 of ink_queue.ccwhere it calls ink_freelist_free instead of ink_freelist_free_wrap....  another sign of a very old bug....  Well...  Fix that...  fire it up...


Stable...  REAL STABLE  It has been running for over twenty hours now hammered by my ruby script load generator with no trace of errors where it used to barf and crash in seconds or minutes, or occasionally an hour.


But I can not get it toshow over 56% of CPU with "top" (15-20% max on any given core), not even with 100 load generators pounding on ATS.  Before it didn't take much of anything to make it max out all four virtual cores.  This is, of course, on my Mac VMWare Fusion with the version from 2009...  the inexplicable delays could be some kind of queuing glitch from what I suspect is the probable absence of a monotonically increasing possibly double buffered clock on my VM/Linux Kernel.  So I tried it with the 3.0.3 version with Brian's queuing fix, and it does thesame thing, with query times starting at around 30ms instead of 3 milliseconds...


Turn on the locks, and I get a 5 to 10X performance hit!


I am not sure yet what is up with the degraded performance turning on these mutex locks...  I would expect a hit because of locking delays... but this is REALLY severe... This is why MALLOC in a multi-core box is a challenge, because threadlocks are a real serious pain.  What has me slightly baffled is that f->inkfreelist_mutex appears to be local to each freelist...  they are NOT global locks, and yet it slows down to a snail's pace.


Cameron says it's time to have some other brains look at this one.  That's probably a good judgement call, because I am pretty wired.  Hard to think straight this Friday afternoon.




How to fix this?


Well, tcmalloc has an interesting approach...


Gets me thinking anyway....


Replacing the freepools outright with tcmalloc doesn't seem like an optimal solution.  Might wind up with a memory fragmentation problem...  Also the current code has been debugged and is running using the freepool's memcpy of a prototype object in place of a constructor.  Given how much code there is in ATS, I would not want to force every class that uses the freepools to suddenly call a constructor for every object...  There is no telling what that could break if there are bugs in any of the previously unexercised constructors.


So I am thinking to have per thread free pools, that share a global pool.  In that kind of design, if the per thread free pool has anything available, the memory allocator doesn't need to lock.  It only needs to lock when grabbing a batch of, say 100 items, from the global shared pool, or when returning excessivenumbers of freed items back to the global free pool.  Every thread has to be able to return items back into it's own free pool no matter which thread allocated it, so I am initially imagining that having a global pool with locking batch operators feeding thread pools would be the best design.... though there could be better.


One complicated option is having a dedicated thread that was triggered by watermarks detected by eachthread that owned a pool that would push or pull batches of thread pool objects out of the global pools.  That is kind of complicated, but it would keep the thread pools full all the time with never a locking delay.  Assuming I do it right, of course.






A couple of years ago, I worked for 21 months at Microsoft's bing.com's UX Platform Team (since renamed the "Front Door Team" I have been told), on their front line server team that handled all the search traffic for all of bing.com.  I am a grey haired C++ server performance andstability WIZARD (for lack of a better term) with industry experience goingback into the 1980's.  I am probably just what the ATS team needs as a committer…


The "-ND" in my name field, means "Not a Disney employee".  I am a contractor here, with career aspirations.


I have heard from numerous sources that the ATS committer team members are regarded as celebrities in the field, especially if they hold onto that distinction for any length of time.  That someone who gets on the ATS committer team just does not wind up on unemployment....  That Disney management (who seem to like me.... and I just love this place and the people here...) would absolutely NEVER even think of letting me go.


Some say that the best way to go is to show up with a huge bang and snow your doors in.


One of the big motivators to take this contract position was something Cameron said: "we need someone who knows C++ to learn ATS inside and out, which might provide the opportunity for open source contributions and making a name for yourself".


So... Umm.. HI THERE!