trafficserver-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Plevyak <jplev...@gmail.com>
Subject Re: Look Mom, no locks!
Date Sat, 03 Mar 2012 17:06:12 GMT
On review, according to the gcc documentation, the ink_atomic_cas64 which
is a __sync_bool_compare_and_swap()
is a full memory barrier, so the memory barrier already exists and it is
unlikely (barring a bug in gcc) that making
INK_MEMORY_BARRIER into an mfence will have any impact.

I think the best way forward is to try to get test_freelist to fail.

john

On Sat, Mar 3, 2012 at 8:25 AM, John Plevyak <jplevyak@gmail.com> wrote:

>
> James et. al.
>
> There are no locks by design, so that, in-and-of-itself, is not a bug.
> That said, the code relies on subtle aspects of the memory consistency
> model of the system on which it is run.  Each processor architecture has
> subtle differences in memory consistency and in particular, the Intel chips
> have changed greatly over the years since that code was written.  It is
> quite possible that on some processor architectures the appropriate memory
> barriers are not being used to ensure correct behavior.   That said, having
> worked on this "bug" for many years I have found that most of the time the
> problem has been a double free in user code which
> corrupts the freelist.  That is not to say that this is the case here
> because of the aforementioned chip changes.
>
> The "use mutex" flag was added only for the "micro ts" which did not
> support 64-bit atomic operations and therefore could not
> use the atomic code path.  The "micro ts" was limited in performance for a
> variety of reasons so that the additional overhead
> of using a mutex was not an issue.
>
> I agree with your analysis that a mutex or tcmalloc are not the solutions.
>  There are other memory allocators and perhaps
> one of them could work.
>
> In the past, problems could be discovered by running lib/ts/test_freelist.
>
> This generates 64 threads which do nothing but bang on the freelist as
> hard as they can.  On my box (4 x Intel 2.66 core2)
> this runs fine.
>
> I would encourage you to run the test on machines which are showing
> instability and see if it works.  I would also encourage
> you to see if you can augment the test to get it to fail as no matter the
> solution, having a good test/micro-benchmark will
> be extremely valuable in discovering and fixing problems and in evaluating
> potential solutions.
>
> According to the Intel memory models I am certain that the
> INK_*_MEMORY_BARRIER
> macros should not be empty on Intel, but should be sfence, lfence and
> mfence instructions and that
> an sfense (INK_WRITE_MEMORY_BARRIER or the more conservative
> INK_MEMORY_BARRIER which is already there)
> should be issued at ink_queue.cc:533.  I will build a patch to fix the
> memory barriers.
>
> thanx,
> john
>
>
> On Fri, Mar 2, 2012 at 8:23 PM, Jason Giedymin <jason.giedymin@gmail.com>wrote:
>
>> Excuse my Lang, but freakin A dude.
>>
>> Bravo.
>>
>> -Jason
>>
>> On Mar 2, 2012, at 10:29 PM, "Whitaker, James X. -ND" <
>> James.X.Whitaker.-ND@disney.com> wrote:
>>
>>  I imagine that it isprobably humanly impossible to see THIS bug with
>> code inspection.  Or even by stepping through it in a debugger.****
>>
>> ** **
>>
>> I must have scoured over this code about a dozen times looking for
>> problems, and I missed it.  I had to utterly pulverize ATS source code
>> with instrumentation generating ruby scripts to zero in on it.  What a
>> mess I made.****
>>
>> ** **
>>
>> On a RedHat or CentOS box, go into lib/ts, edit the Makefile changing the
>> gcc optimizer flags to –E, run make in that lib/ts directory and look at
>> the preprocessor output in the files ink_queue.o and Allocator.o****
>>
>> ** **
>>
>> You will have to stare, and stare, and stare before you see it.  And you
>> gotta have the right kind of brain.****
>>
>> ** **
>>
>> ink_freelist_new()****
>>
>> ** **
>>
>> f->head can get grabbed by simultaneous threads/cores that call
>> ink_freelist_new() at the same instant in time -- which will grab the same
>> memory location on each thread.****
>>
>> ** **
>>
>> The remaining code with all those atomic operators neglects to couple the
>> assignment of item=f->head to the increments of head within a single lock.
>> ****
>>
>> ** **
>>
>> Allocator.o shows that there are no locks.****
>>
>> ** **
>>
>> The more cores used by ATS, the more unstable ATS will be.****
>>
>> ** **
>>
>> ---****
>>
>> ** **
>>
>> I would imagine that the best of the best of the world’s developers have
>> been chasing the truly baffling side effects of this bug down for years,
>> and probably just plain giving up.  There are all sorts of odd
>> mitigations in the code.  Lots of mystery crash reports.  I have to
>> wonder if it is all due to this.****
>>
>> ** **
>>
>> ---****
>>
>> ** **
>>
>> Credit for this bug find goes to the Foundational Technologies Team at
>> Disney, Cameron Stevens, my supervisor who pushed me (and paid me) and
>> cleared my to do list to make room to find this thing, of course moi who
>> spent the last several weeks wracking my brains coming up with one
>> (sometimes absurd) hypothesis after another and one debugging and code
>> scanning technique after another trying to zero in on it...  and I
>> suppose Lisa providing the incentive of stressing the need for financial
>> stability before we get married... a serious incentive to make a name for
>> myself.****
>>
>> ** **
>>
>> ---****
>>
>> ** **
>>
>> I created some ruby scripts to imbed instrumentation in the source code
>> (and lots of hand cleanup to make it compile... sigh).****
>>
>> ** **
>>
>> Among other hacks:****
>>
>>    each continuation is assigned a unique ID (jimw_stamp) when it is
>> created****
>>
>>    each InkContInternal::init call creates another unique ID****
>>
>>        (because the freepool allocator does a memcopy without
>> constructor invocation).****
>>
>>    each time an event was scheduled, the event inherits the jimw_stamp
>> from the continuation of "origin"****
>>
>>    every schedule_* macro invocation and TSContSchedule and a few others
>> kept track of __FILE__ and __LINE__ that called it, logging that in new
>> member variables of the event object.****
>>
>>    the configure script was altered by replacing "-O2" with "-O0 -ggdb3"
>> for the release version ****
>>
>> ** **
>>
>> When it crashes, thedebugger shows me:****
>>
>>     the Event::jimw_stamp taken from the continuation whose associated
>> instruction stream originated the event****
>>
>>     the InkContInternal::jimw_stamp of the receiving continuation****
>>
>>     The line and file number in the source code that originated the
>> event that triggered the crash****
>>
>> ** **
>>
>> The core dump showed an event generated for a continuation that had been
>> created by TSContCreate in(lets call it) path A through the code.  The
>> event handler that received the event was the code for handling a
>> completely different continuation, path B.****
>>
>> ** **
>>
>> Both Path A and Path B were based upon continuations that were created
>> with TSContCreate.  Usually, but not always, the handler function whose
>> code triggered the crash would be code that was for a continuation that was
>> a "helper" continuation intended to support the HttSM class or other
>> similar classes.  This Path B code was utterly unprepared to handle the
>> events from Path A, would do something freaky, and quite naturally trigger
>> an assert.****
>>
>> ** **
>>
>> It took lots of convoluted setup to allow RHEL to generate core dumps of
>> setuid programs...****
>>
>> ** **
>>
>> The big boogeyman, HOWEVER, is that both the incoming EVENT and the
>> receiving mismatched CONTINUATION had the same "jimw_stamp" tag in them.*
>> ***
>>
>> ** **
>>
>> WOAH!****
>>
>> ** **
>>
>> Think about that for a moment.****
>>
>> ** **
>>
>> An event that had been assigned the unique tag from the continuation's
>> allocated by TSContCreate in Path A is received by the code for another
>> contination created by TSContCreate in Path B.  They share the same
>> guaranteed-by-the-constructor-to-be-unique ID tag.****
>>
>> ** **
>>
>> The only explanation for that possible is that two continuations somehow
>> had the same jimw_stamp!  The same unique tag.  Physically impossible.  That
>> means that TWO LIVE CONTINUATIONS had to be occupying the SAME MEMORY SLOT
>> IN THE FREELIST.  One of the continuations is receiving all the events
>> for both of them!****
>>
>> ** **
>>
>> "YOU HAVE GOT TO BE KIDDING ME!"  I said out loud.  Banged my ankle on
>> something really hard.  Still nursing the wound.****
>>
>> ** **
>>
>> Um... that would be bad.  Real bad.  Unlikely, I thought.  That would
>> mean a bug in a freepool allocator...  most likely a missing mutex....  I
>> could not imagine that kind of thing ever getting through a code review
>> guantlet.  Hundreds of developers have inspected THAT code, right?****
>>
>> ** **
>>
>> Code inspection shows that there are locks and mutexes and spinlocks,
>> atomic incriments, atomic cas.....  There is NO WAY that could fail
>> right?  It looks like it should be bulletproof, but look at all those
>> preprocessor directives!****
>>
>> ** **
>>
>> Well, ink_freelist_new is not the easiest code to figure out.  Take a
>> look at it....  mind boggling....  After an hour or more I just plain
>> gave up trying to figure out what it is ACTUALLY DOING.  I went digging
>> through gcc manuals and figured out how to turn on pre-processor output.
>> If this code, so full of preprocessor directives, and so full of locks is
>> putting two thingsinto the same memory location: WHAT ON EARTH IS ACTUALLY
>> GETTING COMPILED?  I edited the Makefile and put in -E in place of the
>> optimizer flags, and let the preprocessor do it's thing.  What on earth
>> does the actual C++ code thatis actually getting compiled in look like?**
>> **
>>
>> ** **
>>
>> ---****
>>
>> ** **
>>
>> ink_queue.o, preprocessor output****
>>
>> ** **
>>
>> ** **
>>
>> void *****
>>
>> ink_freelist_new(InkFreeList * f)****
>>
>> ** **
>>
>> {****
>>
>> # 234 "ink_queue.cc"****
>>
>>   head_p item;****
>>
>>   head_p next;****
>>
>>   int result = 0;****
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>
>>   do {****
>>
>>     *((uint64_t*)&(item)) = *((uint64_t*)&(f->head));****
>>
>>     if (((void*)(((void*)(((((intptr_t)(item).data)<<16)>>16) |
>> (((~((((intptr_t)(item).data)<<16>>63)-1))>>48)<<48)))))
== __null) {****
>>
>>       uint32_t type_size = f->type_size;****
>>
>>       uint32_t i;****
>>
>> # 255 "ink_queue.cc"****
>>
>>       void *newp = __null;****
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>
>>       if (f->alignment)****
>>
>>         newp = ink_memalign(f->alignment, f->chunk_size * type_size);****
>>
>>       else****
>>
>>         newp = ink_malloc(f->chunk_size * type_size);****
>>
>>       if (newp)****
>>
>>         ink_atomic_increment64(&freelist_allocated_mem, (int64_t)
>> (f->chunk_size * type_size));;****
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>
>>       (item).data = ((((intptr_t)(newp))&0x0000FFFFFFFFFFFFULL) |
>> (((0)&0xFFFFULL) << 48));****
>>
>> # 291 "ink_queue.cc"****
>>
>>       ink_atomic_increment((int *) &f->allocated, f->chunk_size);****
>>
>>       ink_atomic_increment64(&fastalloc_mem_total, (int64_t)
>> f->chunk_size * f->type_size);****
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>
>>       for (i = 0; i < f->chunk_size; i++) {****
>>
>>         char *a = ((char *) ((void*)(((((intptr_t)(item).data)<<16)>>16)
>> | (((~((((intptr_t)(item).data)<<16>>63)-1))>>48)<<48))))
+ i * type_size;
>> ****
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>
>>         ink_freelist_free(f, a);****
>>
>> # 315 "ink_queue.cc"****
>>
>>       }****
>>
>> ** **
>>
>>       ink_atomic_increment((int *) &f->count, f->chunk_size);****
>>
>>       ink_atomic_increment64(&fastalloc_mem_in_use, (int64_t)
>> f->chunk_size * f->type_size);****
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>
>>     } else {****
>>
>>       (next).data = ((((intptr_t)(*((void **)((char
>> *)((void*)(((void*)(((((intptr_t)(item).data)<<16)>>16) |
>> (((~((((intptr_t)(item).data)<<16>>63)-1))>>48)<<48)))))
+
>> f->offset))))&0x0000FFFFFFFFFFFFULL) | ((((((intptr_t)(item).data)>>48)
+
>> 1)&0xFFFFULL) << 48));****
>>
>> ** **
>>
>> ** **
>>
>>       result = ink_atomic_cas64((int64_t *) & f->head.data, item.data,
>> next.data);****
>>
>> # 345 "ink_queue.cc"****
>>
>>     }****
>>
>>   }****
>>
>>   while (result == 0);****
>>
>>   (void)(!((uintptr_t)((void*)(((void*)(((((intptr_t)(item).data)<<16)>>16)
>> |
>> (((~((((intptr_t)(item).data)<<16>>63)-1))>>48)<<48)))))&(((uintptr_t)f->alignment)-1)));
>> ****
>>
>> ** **
>>
>>   ink_atomic_increment((int *) &f->count, 1);****
>>
>>   ink_atomic_increment64(&fastalloc_mem_in_use, (int64_t) f->type_size);*
>> ***
>>
>> ** **
>>
>>   return ((void*)(((void*)(((((intptr_t)(item).data)<<16)>>16) |
>> (((~((((intptr_t)(item).data)<<16>>63)-1))>>48)<<48)))));****
>>
>> ** **
>>
>> }****
>>
>> ** **
>>
>> ** **
>>
>> Allocator.o section:****
>>
>> ** **
>>
>> template<class C> inline C * ClassAllocator<C>::alloc()****
>>
>> {****
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>
>>   void *ptr = ink_freelist_new(&this->fl);****
>>
>> ** **
>>
>>   if (sizeof(C) < 512) {****
>>
>>     for (unsigned int i = 0; i < (((sizeof(C))+15)&~15) /
>> sizeof(int64_t); i++)****
>>
>>       ((int64_t *) ptr)[i] = ((int64_t *) &this->proto.typeObject)[i];***
>> *
>>
>>   } else****
>>
>>     memcpy(ptr, &this->proto.typeObject, sizeof(C));****
>>
>>   return (C *) ptr;****
>>
>> }****
>>
>> ** **
>>
>> InkAPI.cc:****
>>
>> ** **
>>
>> static ClassAllocator<INKContInternal>
>> INKContAllocator("INKContAllocator");****
>>
>> ** **
>>
>> later in InkAPI.cc****
>>
>> ** **
>>
>> ////////////////////////////////////////////////////////////////////****
>>
>> //****
>>
>> // Continuations****
>>
>> //****
>>
>> ////////////////////////////////////////////////////////////////////****
>>
>> ** **
>>
>> TSCont****
>>
>> TSContCreate(TSEventFunc funcp, TSMutex mutexp)****
>>
>> {****
>>
>>   // mutexp can be NULL****
>>
>>   if (mutexp != NULL)****
>>
>>     sdk_assert(sdk_sanity_check_mutex(mutexp) == TS_SUCCESS);****
>>
>> ** **
>>
>>   INKContInternal *i = INKContAllocator.alloc();****
>>
>> ** **
>>
>>   i->init(funcp, mutexp);****
>>
>> // printf ("CREATINGCONT: %p  marker: %d\n", i, i->jimw_stamp);****
>>
>>   return (TSCont)i;****
>>
>> }****
>>
>> ** **
>>
>> ** **
>>
>> LOOK MOM!  NO LOCKS!!!!!****
>>
>> ** **
>>
>> ---------------------------------****
>>
>> ** **
>>
>> Now, my guess is that for who knows how many years, anyone reviewing the
>> code saw this in ink_queue.h and assumed the locks were in place.****
>>
>> ** **
>>
>> INK_USE_MUTEX_FOR_FREELISTS is not in actuality defined, it is only
>> defined when IS_TS_MICRO_BUILD (??!?) is 1, which it is NOT.****
>>
>> ** **
>>
>> If you do a "grep", it shows the line where INK_USE_MUTEX_FOR_FREELISTS
>> is defined.****
>>
>> ** **
>>
>> 161 #if !defined(INK_USE_MUTEX_FOR_FREELISTS)****
>>
>> 162   inkcoreapi void *ink_freelist_new(InkFreeList * f);****
>>
>> 163   inkcoreapi void ink_freelist_free(InkFreeList * f, void *item);****
>>
>> 164 #else                           /* INK_USE_MUTEX_FOR_FREELISTS */****
>>
>> 165   inkcoreapi void *ink_freelist_new_wrap(InkFreeList * f);****
>>
>> 166   static inline void *ink_freelist_new(InkFreeList * f)****
>>
>> 167   {****
>>
>> 168     void *retval = NULL;****
>>
>> 169 ****
>>
>> 170     ink_mutex_acquire(&(f->inkfreelist_mutex));****
>>
>> 171     retval = ink_freelist_new_wrap(f);****
>>
>> 172     ink_mutex_release(&(f->inkfreelist_mutex));****
>>
>> 173     return retval;****
>>
>> 174   }****
>>
>> 175 ****
>>
>> 176   inkcoreapi void ink_freelist_free_wrap(InkFreeList * f, void
>> *item);****
>>
>> 177   static inline void ink_freelist_free(InkFreeList * f, void *item)**
>> **
>>
>> 178   {****
>>
>> 179     ink_mutex_acquire(&(f->inkfreelist_mutex));****
>>
>> 180     ink_freelist_free_wrap(f, item);****
>>
>> 181     ink_mutex_release(&(f->inkfreelist_mutex));****
>>
>> 182   }****
>>
>> 183 #endif /* INK_USE_MUTEX_FOR_FREELISTS */****
>>
>> ** **
>>
>> If you turn it on bydefining INK_USE_MUTEX_FOR_FREELISTS at the top of
>> ink_queue.h, the code does NOT compile.  This suggests a VERY old bug.  Once
>> you get it to compile by adding the missing header file, it deadlocks on
>> itself in line 306 of ink_queue.ccwhere it calls ink_freelist_free instead
>> of ink_freelist_free_wrap....  another sign of a very old bug....
>> Well...  Fix that...  fire it up...****
>>
>> ** **
>>
>> Stable...  REAL STABLE  It has been running for over twenty hours now
>> hammered by my ruby script load generator with no trace of errors where it
>> used to barf and crash in seconds or minutes, or occasionally an hour.***
>> *
>>
>> ** **
>>
>> But I can not get it to show over 56% of CPU with "top" (15-20% max on
>> any given core), not even with 100 load generators pounding on ATS.  Before
>> it didn't take much of anything to make it max out all four virtual cores.
>> This is, of course, on my Mac VMWare Fusion with the version from 2009...
>> the inexplicable delays could be some kind of queuing glitch from what I
>> suspect is the probable absence of a monotonically increasing possibly
>> double buffered clock on my VM/Linux Kernel.  So I tried it with the
>> 3.0.3 version with Brian's queuing fix, and it does thesame thing, with
>> query times starting at around 30ms instead of 3 milliseconds...****
>>
>> ** **
>>
>> Turn on the locks, and I get a 5 to 10X performance hit!****
>>
>> ** **
>>
>> I am not sure yet what is up with the degraded performance turning on
>> these mutex locks...  I would expect a hit because of locking delays...
>> but this is REALLY severe... This is why MALLOC in a multi-core box is a
>> challenge, because thread locks are a real serious pain.  What has me
>> slightly baffled is that f->inkfreelist_mutex appears to be local to each
>> freelist...  they are NOT global locks, and yet it slows down to a
>> snail's pace.****
>>
>> ** **
>>
>> Cameron says it's time to have some other brains look at this one.  That's
>> probably a good judgement call, because I am pretty wired.  Hard to
>> think straight this Friday afternoon.****
>>
>> ** **
>>
>> ----------------------------------****
>>
>> ** **
>>
>> How to fix this?****
>>
>> ** **
>>
>> Well, tcmalloc has an interesting approach...****
>>
>> ** **
>>
>> Gets me thinking anyway....****
>>
>> ** **
>>
>> Replacing the freepools outright with tcmalloc doesn't seem like an
>> optimal solution.  Might wind up with a memory fragmentation problem...  Also
>> the current code has been debugged and is running using the freepool's
>> memcpy of a prototype object in place of a constructor.  Given how much
>> code there is in ATS, I would not want to force every class that uses the
>> freepools to suddenly call a constructor for every object...  There is
>> no telling what that could break if there are bugs in any of the previously
>> unexercised constructors.****
>>
>> ** **
>>
>> So I am thinking to have per thread free pools, that share a global pool.
>> In that kind of design, if the per thread free pool has anything
>> available, the memory allocator doesn't need to lock.  It only needs to
>> lock when grabbing a batch of, say 100 items, from the global shared pool,
>> or when returning excessivenumbers of freed items back to the global free
>> pool.  Every thread has to be able to return items back into it's own
>> free pool no matter which thread allocated it, so I am initially imagining
>> that having a global pool with locking batch operators feeding thread pools
>> would be the best design.... though there could be better.****
>>
>> ** **
>>
>> One complicated option is having a dedicated thread that was triggered by
>> watermarks detected by eachthread that owned a pool that would push or pull
>> batches of thread pool objects out of the global pools.  That is kind of
>> complicated, but it would keep the thread pools full all the time with
>> never a locking delay.  Assuming I do it right, of course.****
>>
>> ** **
>>
>> Ideas?****
>>
>> ** **
>>
>> ---------------------------------****
>>
>> ** **
>>
>> A couple of years ago, I worked for 21 months at Microsoft's bing.com's
>> UX Platform Team (since renamed the "Front Door Team" I have been told), on
>> their front line server team that handled all the search traffic for all of
>> bing.com.  I am a grey haired C++ server performance andstability WIZARD
>> (for lack of a better term) with industry experience goingback into the
>> 1980's.  I am probably just what the ATS team needs as a committer…****
>>
>> ** **
>>
>> The "-ND" in my name field, means "Not a Disney employee".  I am a
>> contractor here, with career aspirations.****
>>
>> ** **
>>
>> I have heard from numerous sources that the ATS committer team members
>> are regarded as celebrities in the field, especially if they hold onto that
>> distinction for any length of time.  That someone who gets on the ATS
>> committer team just does not wind up on unemployment....  That Disney
>> management (who seem to like me.... and I just love this place and the
>> people here...) would absolutely NEVER even think of letting me go.****
>>
>> ** **
>>
>> Some say that the best way to go is to show up with a huge bang and snow
>> your doors in.****
>>
>> ** **
>>
>> One of the big motivators to take this contract position was something
>> Cameron said: "we need someone who knows C++ to learn ATS inside and out,
>> which might provide the opportunity for open source contributions and
>> making a name for yourself".****
>>
>> ** **
>>
>> So... Umm.. HI THERE!****
>>
>> ** **
>>
>> ** **
>>
>>
>

Mime
View raw message