trafficserver-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Whitaker, James X. -ND" <James.X.Whitaker....@disney.com>
Subject Re: Look Mom, no locks!
Date Sat, 10 Mar 2012 00:59:56 GMT
Yes….  It is a pretty interesting find.  But WAY too much tension has built up around it.

From: Jason Giedymin <jason.giedymin@gmail.com<mailto:jason.giedymin@gmail.com>>
Reply-To: "users@trafficserver.apache.org<mailto:users@trafficserver.apache.org>" <users@trafficserver.apache.org<mailto:users@trafficserver.apache.org>>
Date: Fri, 9 Mar 2012 16:57:35 -0800
To: "users@trafficserver.apache.org<mailto:users@trafficserver.apache.org>" <users@trafficserver.apache.org<mailto:users@trafficserver.apache.org>>
Subject: Re: Look Mom, no locks!

Sorry, was using this as an example of an exciting report vs a dull one at work.

-Jason

On Mar 9, 2012, at 7:55 PM, Jason Giedymin <jason.giedymin@gmail.com<mailto:jason.giedymin@gmail.com>>
wrote:

Came in last week, check out this awesome bug report.

-Jason

On Mar 2, 2012, at 10:29 PM, "Whitaker, James X. -ND" <James.X.Whitaker.-ND@disney.com<mailto:James.X.Whitaker.-ND@disney.com>>
wrote:

I imagine that it isprobably humanly impossible to see THIS bug with code inspection.  Or
even by stepping through it in a debugger.

I must have scoured over this code about a dozen times looking for problems, and I missed
it.  I had to utterly pulverize ATS source code with instrumentation generating ruby scripts
to zero in on it.  What a mess I made.

On a RedHat or CentOS box, go into lib/ts, edit the Makefile changing the gcc optimizer flags
to –E, run make in that lib/ts directory and look at the preprocessor output in the files
ink_queue.o and Allocator.o

You will have to stare, and stare, and stare before you see it.  And you gotta have the right
kind of brain.

ink_freelist_new()

f->head can get grabbed by simultaneous threads/cores that call ink_freelist_new() at the
same instant in time -- which will grab the same memory location on each thread.

The remaining code with all those atomic operators neglects to couple the assignment of item=f->head
to the increments of head within a single lock.

Allocator.o shows that there are no locks.

The more cores used by ATS, the more unstable ATS will be.

---

I would imagine that the best of the best of the world’s developers have been chasing the
truly baffling side effects of this bug down for years, and probably just plain giving up.
 There are all sorts of odd mitigations in the code.  Lots of mystery crash reports.  I have
to wonder if it is all due to this.

---

Credit for this bug find goes to the Foundational Technologies Team at Disney, Cameron Stevens,
my supervisor who pushed me (and paid me) and cleared my to do list to make room to find this
thing, of course moi who spent the last several weeks wracking my brains coming up with one
(sometimes absurd) hypothesis after another and one debugging and code scanning technique
after another trying to zero in on it...  and I suppose Lisa providing the incentive of stressing
the need for financial stability before we get married... a serious incentive to make a name
for myself.

---

I created some ruby scripts to imbed instrumentation in the source code (and lots of hand
cleanup to make it compile... sigh).

Among other hacks:
   each continuation is assigned a unique ID (jimw_stamp) when it is created
   each InkContInternal::init call creates another unique ID
       (because the freepool allocator does a memcopy without constructor invocation).
   each time an event was scheduled, the event inherits the jimw_stamp from the continuation
of "origin"
   every schedule_* macro invocation and TSContSchedule and a few others kept track of __FILE__
and __LINE__ that called it, logging that in new member variables of the event object.
   the configure script was altered by replacing "-O2" with "-O0 -ggdb3" for the release version

When it crashes, thedebugger shows me:
    the Event::jimw_stamp taken from the continuation whose associated instruction stream
originated the event
    the InkContInternal::jimw_stamp of the receiving continuation
    The line and file number in the source code that originated the event that triggered the
crash

The core dump showed an event generated for a continuation that had been created by TSContCreate
in(lets call it) path A through the code. The event handler that received the event was the
code for handling a completely different continuation, path B.

Both Path A and Path B were based upon continuations that were created with TSContCreate.
 Usually, but not always, the handler function whose code triggered the crash would be code
that was for a continuation that was a "helper" continuation intended to support the HttSM
class or other similar classes.  This Path B code was utterly unprepared to handle the events
from Path A, would do something freaky, and quite naturally trigger an assert.

It took lots of convoluted setup to allow RHEL to generate core dumps of setuid programs...

The big boogeyman, HOWEVER, is that both the incoming EVENT and the receiving mismatched CONTINUATION
had the same "jimw_stamp" tag in them.

WOAH!

Think about that for a moment.

An event that had been assigned the unique tag from the continuation's allocated by TSContCreate
in Path A is received by the code for another contination created by TSContCreate in Path
B.  They share the same guaranteed-by-the-constructor-to-be-unique ID tag.

The only explanation for that possible is that two continuations somehow had the same jimw_stamp!
 The same unique tag.  Physically impossible.  That means that TWO LIVE CONTINUATIONS had
to be occupying the SAME MEMORY SLOT IN THE FREELIST.  One of the continuations is receiving
all the events for both of them!

"YOU HAVE GOT TO BE KIDDING ME!"  I said out loud.  Banged my ankle on something really hard.
 Still nursing the wound.

Um... that would be bad.  Real bad.  Unlikely, I thought.  That would mean a bug in a freepool
allocator...  most likely a missing mutex....  I could not imagine that kind of thing ever
getting through a code review guantlet.  Hundreds of developers have inspected THAT code,
right?

Code inspection shows that there are locks and mutexes and spinlocks, atomic incriments, atomic
cas.....  There is NO WAY that could fail right?  It looks like it should be bulletproof,
but look at all those preprocessor directives!

Well, ink_freelist_new isnot the easiest code to figure out.  Take a look at it....  mind
boggling....  After an hour or more I just plain gave up trying to figure out what it is ACTUALLY
DOING.  I went digging through gcc manuals and figured out how to turn on pre-processor output.
 If this code, so full of preprocessor directives, and so full of locks is putting two thingsinto
the same memory location: WHAT ON EARTH IS ACTUALLY GETTING COMPILED?  I edited the Makefile
and put in -E in place of the optimizer flags, and let the preprocessor do it's thing.  What
on earth does the actual C++ code thatis actually getting compiled in look like?

---

ink_queue.o, preprocessor output


void *
ink_freelist_new(InkFreeList * f)

{
# 234 "ink_queue.cc"
  head_p item;
  head_p next;
  int result = 0;




  do {
    *((uint64_t*)&(item)) = *((uint64_t*)&(f->head));
    if (((void*)(((void*)(((((intptr_t)(item).data)<<16)>>16) | (((~((((intptr_t)(item).data)<<16>>63)-1))>>48)<<48)))))
== __null) {
      uint32_t type_size = f->type_size;
      uint32_t i;
# 255 "ink_queue.cc"
      void *newp = __null;




      if (f->alignment)
        newp = ink_memalign(f->alignment, f->chunk_size * type_size);
      else
        newp = ink_malloc(f->chunk_size * type_size);
      if (newp)
        ink_atomic_increment64(&freelist_allocated_mem, (int64_t) (f->chunk_size *
type_size));;






      (item).data = ((((intptr_t)(newp))&0x0000FFFFFFFFFFFFULL) | (((0)&0xFFFFULL)
<< 48));
# 291 "ink_queue.cc"
      ink_atomic_increment((int *) &f->allocated, f->chunk_size);
      ink_atomic_increment64(&fastalloc_mem_total, (int64_t) f->chunk_size * f->type_size);






      for (i = 0; i < f->chunk_size; i++) {
        char *a = ((char *) ((void*)(((((intptr_t)(item).data)<<16)>>16) | (((~((((intptr_t)(item).data)<<16>>63)-1))>>48)<<48))))
+ i * type_size;





        ink_freelist_free(f, a);
# 315 "ink_queue.cc"
      }

      ink_atomic_increment((int *) &f->count, f->chunk_size);
      ink_atomic_increment64(&fastalloc_mem_in_use, (int64_t) f->chunk_size * f->type_size);





    } else {
      (next).data = ((((intptr_t)(*((void **)((char *)((void*)(((void*)(((((intptr_t)(item).data)<<16)>>16)
| (((~((((intptr_t)(item).data)<<16>>63)-1))>>48)<<48))))) + f->offset))))&0x0000FFFFFFFFFFFFULL)
| ((((((intptr_t)(item).data)>>48) + 1)&0xFFFFULL) << 48));


      result = ink_atomic_cas64((int64_t *) & f->head.data, item.data, next.data);
# 345 "ink_queue.cc"
    }
  }
  while (result == 0);
  (void)(!((uintptr_t)((void*)(((void*)(((((intptr_t)(item).data)<<16)>>16) |
(((~((((intptr_t)(item).data)<<16>>63)-1))>>48)<<48)))))&(((uintptr_t)f->alignment)-1)));

  ink_atomic_increment((int *) &f->count, 1);
  ink_atomic_increment64(&fastalloc_mem_in_use, (int64_t) f->type_size);

  return ((void*)(((void*)(((((intptr_t)(item).data)<<16)>>16) | (((~((((intptr_t)(item).data)<<16>>63)-1))>>48)<<48)))));

}


Allocator.o section:

template<class C> inline C * ClassAllocator<C>::alloc()
{



  void *ptr = ink_freelist_new(&this->fl);

  if (sizeof(C) < 512) {
    for (unsigned int i = 0; i < (((sizeof(C))+15)&~15) / sizeof(int64_t); i++)
      ((int64_t *) ptr)[i] = ((int64_t *) &this->proto.typeObject)[i];
  } else
    memcpy(ptr, &this->proto.typeObject, sizeof(C));
  return (C *) ptr;
}

InkAPI.cc:

static ClassAllocator<INKContInternal> INKContAllocator("INKContAllocator");

later in InkAPI.cc

////////////////////////////////////////////////////////////////////
//
// Continuations
//
////////////////////////////////////////////////////////////////////

TSCont
TSContCreate(TSEventFunc funcp, TSMutex mutexp)
{
  // mutexp can be NULL
  if (mutexp != NULL)
    sdk_assert(sdk_sanity_check_mutex(mutexp) == TS_SUCCESS);

  INKContInternal *i = INKContAllocator.alloc();

  i->init(funcp, mutexp);
// printf ("CREATINGCONT: %p  marker: %d\n", i, i->jimw_stamp);
  return (TSCont)i;
}


LOOK MOM!  NO LOCKS!!!!!

---------------------------------

Now, my guess is that for who knows how many years, anyone reviewing the code saw this in
ink_queue.h and assumed the locks were in place.

INK_USE_MUTEX_FOR_FREELISTS is not in actuality defined, it is only defined when IS_TS_MICRO_BUILD
(??!?) is 1, which it is NOT.

If you do a "grep", it shows the line where INK_USE_MUTEX_FOR_FREELISTS is defined.

161 #if !defined(INK_USE_MUTEX_FOR_FREELISTS)
162   inkcoreapi void *ink_freelist_new(InkFreeList * f);
163   inkcoreapi void ink_freelist_free(InkFreeList * f, void *item);
164 #else                           /* INK_USE_MUTEX_FOR_FREELISTS */
165   inkcoreapi void *ink_freelist_new_wrap(InkFreeList * f);
166   static inline void *ink_freelist_new(InkFreeList * f)
167   {
168     void *retval = NULL;
169
170     ink_mutex_acquire(&(f->inkfreelist_mutex));
171     retval = ink_freelist_new_wrap(f);
172     ink_mutex_release(&(f->inkfreelist_mutex));
173     return retval;
174   }
175
176   inkcoreapi void ink_freelist_free_wrap(InkFreeList * f, void *item);
177   static inline void ink_freelist_free(InkFreeList * f, void *item)
178   {
179     ink_mutex_acquire(&(f->inkfreelist_mutex));
180     ink_freelist_free_wrap(f, item);
181     ink_mutex_release(&(f->inkfreelist_mutex));
182   }
183 #endif /* INK_USE_MUTEX_FOR_FREELISTS */

If you turn it on bydefining INK_USE_MUTEX_FOR_FREELISTS at the top of ink_queue.h, the code
does NOT compile.  This suggests a VERY old bug.  Once you get it to compile by adding the
missing header file, it deadlocks on itself in line 306 of ink_queue.ccwhere it calls ink_freelist_free
instead of ink_freelist_free_wrap....  another sign of a very old bug....  Well...  Fix that...
 fire it up...

Stable...  REAL STABLE  It has been running for over twenty hours now hammered by my ruby
script load generator with no trace of errors where it used to barf and crash in seconds or
minutes, or occasionally an hour.

But I can not get it toshow over 56% of CPU with "top" (15-20% max on any given core), not
even with 100 load generators pounding on ATS.  Before it didn't take much of anything to
make it max out all four virtual cores.  This is, of course, on my Mac VMWare Fusion with
the version from 2009...  the inexplicable delays could be some kind of queuing glitch from
what I suspect is the probable absence of a monotonically increasing possibly double buffered
clock on my VM/Linux Kernel.  So I tried it with the 3.0.3 version with Brian's queuing fix,
and it does thesame thing, with query times starting at around 30ms instead of 3 milliseconds...

Turn on the locks, and I get a 5 to 10X performance hit!

I am not sure yet what is up with the degraded performance turning on these mutex locks...
 I would expect a hit because of locking delays... but this is REALLY severe... This is why
MALLOC in a multi-core box is a challenge, because threadlocks are a real serious pain.  What
has me slightly baffled is that f->inkfreelist_mutex appears to be local to each freelist...
 they are NOT global locks, and yet it slows down to a snail's pace.

Cameron says it's time to have some other brains look at this one.  That's probably a good
judgement call, because I am pretty wired.  Hard to think straight this Friday afternoon.

----------------------------------

How to fix this?

Well, tcmalloc has an interesting approach...

Gets me thinking anyway....

Replacing the freepools outright with tcmalloc doesn't seem like an optimal solution.  Might
wind up with a memory fragmentation problem...  Also the current code has been debugged and
is running using the freepool's memcpy of a prototype object in place of a constructor.  Given
how much code there is in ATS, I would not want to force every class that uses the freepools
to suddenly call a constructor for every object...  There is no telling what that could break
if there are bugs in any of the previously unexercised constructors.

So I am thinking to have per thread free pools, that share a global pool.  In that kind of
design, if the per thread free pool has anything available, the memory allocator doesn't need
to lock.  It only needs to lock when grabbing a batch of, say 100 items, from the global shared
pool, or when returning excessivenumbers of freed items back to the global free pool.  Every
thread has to be able to return items back into it's own free pool no matter which thread
allocated it, so I am initially imagining that having a global pool with locking batch operators
feeding thread pools would be the best design.... though there could be better.

One complicated option is having a dedicated thread that was triggered by watermarks detected
by eachthread that owned a pool that would push or pull batches of thread pool objects out
of the global pools.  That is kind of complicated, but it would keep the thread pools full
all the time with never a locking delay.  Assuming I do it right, of course.

Ideas?

---------------------------------

A couple of years ago, I worked for 21 months at Microsoft's bing.com<http://bing.com>'s
UX Platform Team (since renamed the "Front Door Team" I have been told), on their front line
server team that handled all the search traffic for all of bing.com<http://bing.com>.
 I am a grey haired C++ server performance andstability WIZARD (for lack of a better term)
with industry experience goingback into the 1980's.  I am probably just what the ATS team
needs as a committer…

The "-ND" in my name field, means "Not a Disney employee".  I am a contractor here, with career
aspirations.

I have heard from numerous sources that the ATS committer team members are regarded as celebrities
in the field, especially if they hold onto that distinction for any length of time.  That
someone who gets on the ATS committer team just does not wind up on unemployment....  That
Disney management (who seem to like me.... and I just love this place and the people here...)
would absolutely NEVER even think of letting me go.

Some say that the best way to go is to show up with a huge bang and snow your doors in.

One of the big motivators to take this contract position was something Cameron said: "we need
someone who knows C++ to learn ATS inside and out, which might provide the opportunity for
open source contributions and making a name for yourself".

So... Umm.. HI THERE!



Mime
View raw message