uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eddie Epstein" <eaepst...@gmail.com>
Subject Re: Delta CAS
Date Wed, 09 Jul 2008 12:56:03 GMT
On Wed, Jul 9, 2008 at 1:51 AM, Thilo Goetz <twgoetz@gmx.de> wrote:

> Nothing so easy.  The CAS heap is one large int array.  We grow it
> by allocating a new array with the new desired size and copying the
> old values over to the new one.  There are several issues with this
> method:
> * Copying the data takes a surprisingly long time.  There's a test
> case in core that does nothing but add new FSs to the CAS, a lot of
> them.  Marshall complained about how long it took to run when I
> added it (about 20s on my machine).  If you profile that test case,
> you will see that the vast majority of time is spent in copying
> data from an old heap to a new heap.  If the CAS becomes sufficiently
> large (in the hundreds of MBs), the time it takes to actually add
> FSs to the CAS is completely dwarfed by the time it takes for the
> heap to grow.
> * The heap lives in a single large array, and a new single large
> array is allocated every time the heap grows.  This is a challenge
> for the jvm as it allocates this array in a contiguous block of
> memory.  So there must be enough contiguous space on the jvm heap,
> which likely means a full heap compaction before a new large array
> can be allocated.  Sometimes the jvm fails to allocate that
> contiguous space, even though there are enough free bytes on the
> jvm jeap.
> * Saved the best for last.  When allocating a new array, the old
> one hangs around till we have copied the data.  So we're using twice
> the necessary space for some period of time.  That space is often
> not available.  So any time I see an out-of-memory error for large
> documents (and it's not a bug in the annotator chain), it happens
> when the CAS heap grows; not because there isn't enough room for
> the larger heap, but because the old one is still there as well.
> The CAS can only grow to about half the size we have memory for
> because of that issue.

The situation is more complicated than portrayed. The heap does not have to
shrink, so the growth penalty is rare and can be eliminated entirely if the
max necessary size heap is specified at startup. FS allocated in the heap do
not have any Java object memory overhead. Garbage collection for separate FS
objects would be [much?] worse than the time it takes currently to clear the
used part of a CAS heap.

Going forward, one approach to this problem could be not one
> heap array, but a list of arrays.  Every time we grow the heap,
> we would just add another array.  That approach solves all the
> problems mentioned above while being minimally invasive to the
> way the CAS currently works.  However, it raises a new issue:
> how do you address cells across several arrays in an efficient
> manner?  We don't want to improve performance for large docs at
> the expense of small ones.  So heap addresses might stop being
> the linear sequence of integers they are today.  Maybe we'll
> use the high bits to address the array, and the low bits to
> address cells in a given array.  And there goes the watermark.
> Maybe this won't be necessary, I don't know at this point.

Each FS object could include an ID that would allow maintaining a high water
mark, of course at the expense of another 4 bytes per. With a heap
constructed from multiple discontiguous arrays, each array could include a
relative ID. This is not to say that the high water mark is always the right
approach :)

Excellent suggestion, except why not have this discussion now?
> We just need to put our heads together and figure out how to address
> this requirement to everybody's satisfaction, case closed.  I'm
> not disagreeing with the requirement, just the proposed implementation
> thereof.  Doing this now may save us (ok, me) a lot of trouble later.

Who is against having the discussion now :)


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message