uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thilo Goetz <twgo...@gmx.de>
Subject Re: Delta CAS
Date Wed, 09 Jul 2008 13:18:12 GMT

Eddie Epstein wrote:
> On Wed, Jul 9, 2008 at 1:51 AM, Thilo Goetz <twgoetz@gmx.de> wrote:
>> Nothing so easy.  The CAS heap is one large int array.  We grow it
>> by allocating a new array with the new desired size and copying the
>> old values over to the new one.  There are several issues with this
>> method:
>> * Copying the data takes a surprisingly long time.  There's a test
>> case in core that does nothing but add new FSs to the CAS, a lot of
>> them.  Marshall complained about how long it took to run when I
>> added it (about 20s on my machine).  If you profile that test case,
>> you will see that the vast majority of time is spent in copying
>> data from an old heap to a new heap.  If the CAS becomes sufficiently
>> large (in the hundreds of MBs), the time it takes to actually add
>> FSs to the CAS is completely dwarfed by the time it takes for the
>> heap to grow.
>> * The heap lives in a single large array, and a new single large
>> array is allocated every time the heap grows.  This is a challenge
>> for the jvm as it allocates this array in a contiguous block of
>> memory.  So there must be enough contiguous space on the jvm heap,
>> which likely means a full heap compaction before a new large array
>> can be allocated.  Sometimes the jvm fails to allocate that
>> contiguous space, even though there are enough free bytes on the
>> jvm jeap.
>> * Saved the best for last.  When allocating a new array, the old
>> one hangs around till we have copied the data.  So we're using twice
>> the necessary space for some period of time.  That space is often
>> not available.  So any time I see an out-of-memory error for large
>> documents (and it's not a bug in the annotator chain), it happens
>> when the CAS heap grows; not because there isn't enough room for
>> the larger heap, but because the old one is still there as well.
>> The CAS can only grow to about half the size we have memory for
>> because of that issue.
> The situation is more complicated than portrayed. The heap does not have to
> shrink, so the growth penalty is rare and can be eliminated entirely if the
> max necessary size heap is specified at startup. FS allocated in the heap do

You don't want to allocate a max heap size of 500M just because
you may need one that big.  You don't even want to allocate 10M
ahead of time because if you have many small documents, you can
do more parallel processing.  So no, I can't specify a large enough
heap at start-up and yes, the heap most certainly has to shrink
on CAS reset.

> not have any Java object memory overhead. Garbage collection for separate FS
> objects would be [much?] worse than the time it takes currently to clear the
> used part of a CAS heap.

I won't believe this until I see it, but I wasn't suggesting
this so I'm not going to argue the point, either.

> Going forward, one approach to this problem could be not one
>> heap array, but a list of arrays.  Every time we grow the heap,
>> we would just add another array.  That approach solves all the
>> problems mentioned above while being minimally invasive to the
>> way the CAS currently works.  However, it raises a new issue:
>> how do you address cells across several arrays in an efficient
>> manner?  We don't want to improve performance for large docs at
>> the expense of small ones.  So heap addresses might stop being
>> the linear sequence of integers they are today.  Maybe we'll
>> use the high bits to address the array, and the low bits to
>> address cells in a given array.  And there goes the watermark.
>> Maybe this won't be necessary, I don't know at this point.
> Each FS object could include an ID that would allow maintaining a high water
> mark, of course at the expense of another 4 bytes per. With a heap
> constructed from multiple discontiguous arrays, each array could include a
> relative ID. This is not to say that the high water mark is always the right
> approach :)

I'm trying to decrease the memory overhead, not increase it.

> Excellent suggestion, except why not have this discussion now?
>> We just need to put our heads together and figure out how to address
>> this requirement to everybody's satisfaction, case closed.  I'm
>> not disagreeing with the requirement, just the proposed implementation
>> thereof.  Doing this now may save us (ok, me) a lot of trouble later.
> Who is against having the discussion now :)

Marshall seemed to favor a discussion at a later point.  Maybe
I misinterpreted.

> Eddie

View raw message