uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: Delta CAS
Date Wed, 09 Jul 2008 14:25:33 GMT
Thilo Goetz wrote:
> Eddie Epstein wrote:
>> On Wed, Jul 9, 2008 at 1:51 AM, Thilo Goetz <twgoetz@gmx.de> wrote:
>>> Nothing so easy.  The CAS heap is one large int array.  We grow it
>>> by allocating a new array with the new desired size and copying the
>>> old values over to the new one.  There are several issues with this
>>> method:
>>> * Copying the data takes a surprisingly long time.  There's a test
>>> case in core that does nothing but add new FSs to the CAS, a lot of
>>> them.  Marshall complained about how long it took to run when I
>>> added it (about 20s on my machine).  If you profile that test case,
>>> you will see that the vast majority of time is spent in copying
>>> data from an old heap to a new heap.  If the CAS becomes sufficiently
>>> large (in the hundreds of MBs), the time it takes to actually add
>>> FSs to the CAS is completely dwarfed by the time it takes for the
>>> heap to grow.
>>> * The heap lives in a single large array, and a new single large
>>> array is allocated every time the heap grows.  This is a challenge
>>> for the jvm as it allocates this array in a contiguous block of
>>> memory.  So there must be enough contiguous space on the jvm heap,
>>> which likely means a full heap compaction before a new large array
>>> can be allocated.  Sometimes the jvm fails to allocate that
>>> contiguous space, even though there are enough free bytes on the
>>> jvm jeap.
>>> * Saved the best for last.  When allocating a new array, the old
>>> one hangs around till we have copied the data.  So we're using twice
>>> the necessary space for some period of time.  That space is often
>>> not available.  So any time I see an out-of-memory error for large
>>> documents (and it's not a bug in the annotator chain), it happens
>>> when the CAS heap grows; not because there isn't enough room for
>>> the larger heap, but because the old one is still there as well.
>>> The CAS can only grow to about half the size we have memory for
>>> because of that issue.
>> The situation is more complicated than portrayed. The heap does not 
>> have to
>> shrink, so the growth penalty is rare and can be eliminated entirely 
>> if the
>> max necessary size heap is specified at startup. FS allocated in the 
>> heap do
> You don't want to allocate a max heap size of 500M just because
> you may need one that big.  You don't even want to allocate 10M
> ahead of time because if you have many small documents, you can
> do more parallel processing.  So no, I can't specify a large enough
> heap at start-up and yes, the heap most certainly has to shrink
> on CAS reset.
Some intermediate approach might help here - such as an application or 
annotator being able to provide performance tuning hints to the 
framework.  For instance, a tokenizer might be able to guesstimate the 
number of tokens, based on some average token size estimate divided into 
the size of the document, and provide that as a hint.
>> not have any Java object memory overhead. Garbage collection for 
>> separate FS
>> objects would be [much?] worse than the time it takes currently to 
>> clear the
>> used part of a CAS heap.
> I won't believe this until I see it, but I wasn't suggesting
> this so I'm not going to argue the point, either.
>> Going forward, one approach to this problem could be not one
>>> heap array, but a list of arrays.  Every time we grow the heap,
>>> we would just add another array.  That approach solves all the
>>> problems mentioned above while being minimally invasive to the
>>> way the CAS currently works.  However, it raises a new issue:
>>> how do you address cells across several arrays in an efficient
>>> manner?  We don't want to improve performance for large docs at
>>> the expense of small ones.  So heap addresses might stop being
>>> the linear sequence of integers they are today.  Maybe we'll
>>> use the high bits to address the array, and the low bits to
>>> address cells in a given array.  And there goes the watermark.
>>> Maybe this won't be necessary, I don't know at this point.
>> Each FS object could include an ID that would allow maintaining a 
>> high water
>> mark, of course at the expense of another 4 bytes per. With a heap
>> constructed from multiple discontiguous arrays, each array could 
>> include a
>> relative ID. This is not to say that the high water mark is always 
>> the right
>> approach :)
> I'm trying to decrease the memory overhead, not increase it.
Would there be a solution that would work for the multi-block heap, 
without adding 4 bytes per FS Object?
>> Excellent suggestion, except why not have this discussion now?
>>> We just need to put our heads together and figure out how to address
>>> this requirement to everybody's satisfaction, case closed.  I'm
>>> not disagreeing with the requirement, just the proposed implementation
>>> thereof.  Doing this now may save us (ok, me) a lot of trouble later.
>> Who is against having the discussion now :)
> Marshall seemed to favor a discussion at a later point.  Maybe
> I misinterpreted.
I did not intend to express favoring a discussion at a later point 
versus now.  Discussions at any point are good, IMHO.
>> Eddie

View raw message