uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: [jira] Created: (UIMA-1068) Use of the JCas cache should be configurable
Date Thu, 12 Jun 2008 03:09:49 GMT
Thilo Goetz wrote:
> Marshall Schor wrote:
>> Thilo Goetz (JIRA) wrote:
>> Some applications may break if they require == between instances of 
>> the same JCas object.  Other of course won't care.  So - it's good 
>> for this to be configurable.
> Any annotator that works with this assumption is broken IMO.
> Why would anybody make such an assumption?  
One use case: With JCas it is possible to add fields to the cover class 
(thus, you could add a hashmap object, for instance); this is described 
in the documentation for JCas.  Those field values are only preserved 
for different iterations if the JCas instance is kept. 

> I don't see anything
> in our documentation that encourages this.  To the contrary,
> we say that we don't guarantee object identity for feature
> structures, and that equals() should be used to compare them.
>> It might be good, also, to put in "soft references" for this - which 
>> will be reclaimed if memory gets low.  But this might end up doubling 
>> the size of the storage used for this (to hold the soft reference)...
>> -Marshall
>>> Use of the JCas cache should be configurable
>>> --------------------------------------------
>>>                  Key: UIMA-1068
>>>                  URL: https://issues.apache.org/jira/browse/UIMA-1068
>>>              Project: UIMA
>>>           Issue Type: Improvement
>>>           Components: Core Java Framework
>>>     Affects Versions: 2.2.2
>>>             Reporter: Thilo Goetz
>>>             Assignee: Thilo Goetz
>>>              Fix For: 2.3
>>> The JCas caches all CAS objects that are accessed through it.  This 
>>> means that JCas objects that are no longer used can't be garbage 
>>> collected.  If only part of the processing chain uses the JCas, or 
>>> the caching is redundant for some other reason, this produces a 
>>> severe memory overhead.
>>> I ran the same experiment I ran for UIMA-1067: doubled the size of 
>>> Moby Dick and ran the POS tagger from the sandbox.  I used the 
>>> improved version from UIMA-1067 as base case and simply commented 
>>> out the line that adds JCas objects to the cache.  This reduced the 
>>> required heap size from 115MB to 105MB.  It also improved the 
>>> performance from around 10s for the base case to consistently under 
>>> 9s for the version without any caching.  I looked at the tagger 
>>> source code, and saw that it keeps its own list of tokens around.  
>>> So the savings are just the caching data structure.
>>> There may be cases where the JCas cache is a performance win, though 
>>> I'd be curious to see the benchmarks.  So we should not just turn it 
>>> off, but make it configurable.

View raw message