uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thilo Goetz <twgo...@gmx.de>
Subject Re: [jira] Created: (UIMA-1068) Use of the JCas cache should be configurable
Date Mon, 09 Jun 2008 14:26:45 GMT
Marshall Schor wrote:
> Thilo Goetz (JIRA) wrote:
> Some applications may break if they require == between instances of the 
> same JCas object.  Other of course won't care.  So - it's good for this 
> to be configurable.

Any annotator that works with this assumption is broken IMO.
Why would anybody make such an assumption?  I don't see anything
in our documentation that encourages this.  To the contrary,
we say that we don't guarantee object identity for feature
structures, and that equals() should be used to compare them.

> It might be good, also, to put in "soft references" for this - which 
> will be reclaimed if memory gets low.  But this might end up doubling 
> the size of the storage used for this (to hold the soft reference)...
> -Marshall
>> Use of the JCas cache should be configurable
>> --------------------------------------------
>>                  Key: UIMA-1068
>>                  URL: https://issues.apache.org/jira/browse/UIMA-1068
>>              Project: UIMA
>>           Issue Type: Improvement
>>           Components: Core Java Framework
>>     Affects Versions: 2.2.2
>>             Reporter: Thilo Goetz
>>             Assignee: Thilo Goetz
>>              Fix For: 2.3
>> The JCas caches all CAS objects that are accessed through it.  This 
>> means that JCas objects that are no longer used can't be garbage 
>> collected.  If only part of the processing chain uses the JCas, or the 
>> caching is redundant for some other reason, this produces a severe 
>> memory overhead.
>> I ran the same experiment I ran for UIMA-1067: doubled the size of 
>> Moby Dick and ran the POS tagger from the sandbox.  I used the 
>> improved version from UIMA-1067 as base case and simply commented out 
>> the line that adds JCas objects to the cache.  This reduced the 
>> required heap size from 115MB to 105MB.  It also improved the 
>> performance from around 10s for the base case to consistently under 9s 
>> for the version without any caching.  I looked at the tagger source 
>> code, and saw that it keeps its own list of tokens around.  So the 
>> savings are just the caching data structure.
>> There may be cases where the JCas cache is a performance win, though 
>> I'd be curious to see the benchmarks.  So we should not just turn it 
>> off, but make it configurable.

View raw message