uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thilo Goetz <twgo...@gmx.de>
Subject Re: Delta CAS
Date Wed, 09 Jul 2008 05:51:49 GMT
Eddie Epstein wrote:
> On Tue, Jul 8, 2008 at 4:32 PM, Thilo Goetz <twgoetz@gmx.de> wrote:
> 
>> Precisely.  It's an implementation detail of the CAS heap that
>> we should be able to change -- that we must be able to change
>> if we would like to improve on the heap.  The CAS heap and
>> in particular the way it grows is a major performance bottleneck
>> for large documents.  If we have other parts of UIMA depend on
>> the (bad) implementation details now, we'll never be able to
>> improve on the design.
> 
> 
> Can you give details on how CAS heap growth is a major bottleneck for large
> documents? Do we have specific scenarios that can be used to measure current
> heap performance and potential fixes?
> 

Nothing so easy.  The CAS heap is one large int array.  We grow it
by allocating a new array with the new desired size and copying the
old values over to the new one.  There are several issues with this
method:

* Copying the data takes a surprisingly long time.  There's a test
case in core that does nothing but add new FSs to the CAS, a lot of
them.  Marshall complained about how long it took to run when I
added it (about 20s on my machine).  If you profile that test case,
you will see that the vast majority of time is spent in copying
data from an old heap to a new heap.  If the CAS becomes sufficiently
large (in the hundreds of MBs), the time it takes to actually add
FSs to the CAS is completely dwarfed by the time it takes for the
heap to grow.

* The heap lives in a single large array, and a new single large
array is allocated every time the heap grows.  This is a challenge
for the jvm as it allocates this array in a contiguous block of
memory.  So there must be enough contiguous space on the jvm heap,
which likely means a full heap compaction before a new large array
can be allocated.  Sometimes the jvm fails to allocate that
contiguous space, even though there are enough free bytes on the
jvm jeap.

* Saved the best for last.  When allocating a new array, the old
one hangs around till we have copied the data.  So we're using twice
the necessary space for some period of time.  That space is often
not available.  So any time I see an out-of-memory error for large
documents (and it's not a bug in the annotator chain), it happens
when the CAS heap grows; not because there isn't enough room for
the larger heap, but because the old one is still there as well.
The CAS can only grow to about half the size we have memory for
because of that issue.

Now these things may not be issues for everybody.  You may not
process large documents.  Your analysis may be so slow that you
don't notice those things.  You may be serializing CASes to XML
and sending them over the network, which completely dwarfs the
performance issues mentioned above.

I live in a world where both memory and CPU cycles are at a
premium, *and* we process large documents.  So I do care.

Going forward, one approach to this problem could be not one
heap array, but a list of arrays.  Every time we grow the heap,
we would just add another array.  That approach solves all the
problems mentioned above while being minimally invasive to the
way the CAS currently works.  However, it raises a new issue:
how do you address cells across several arrays in an efficient
manner?  We don't want to improve performance for large docs at
the expense of small ones.  So heap addresses might stop being
the linear sequence of integers they are today.  Maybe we'll
use the high bits to address the array, and the low bits to
address cells in a given array.  And there goes the watermark.
Maybe this won't be necessary, I don't know at this point.

Now what?  Let me quote Adam from another message:

"I think what's happening here is that XmiCasSerializer has a
requirement on the CAS that it can efficiently know which FS are new
(created since some mark - when the CAS was received at a service in
this case).  Currently this is doable using the FS addresses, but if
that were no longer the case, we could build a different marking
mechanism that does the same thing.  So I think I'm agreeing with
Marshall there."

Excellent suggestion, except why not have this discussion now?
We just need to put our heads together and figure out how to address
this requirement to everybody's satisfaction, case closed.  I'm
not disagreeing with the requirement, just the proposed implementation
thereof.  Doing this now may save us (ok, me) a lot of trouble later.

--Thilo




Mime
View raw message