uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Hill <apa...@nickhill.org>
Subject Re: Alternate CAS implementation
Date Thu, 02 Apr 2015 19:55:30 GMT
Thanks Richard, more replies below...

Quoting Richard Eckart de Castilho <rec@apache.org>:

> Hi Nick,
> On 02.04.2015, at 01:37, Nick Hill <apache@nickhill.org> wrote:
>>> From my point of view, it would be nice if it was possible to  
>>> configure the UIMA framework to produce either this new kind of  
>>> CAS or the old one without having to exchange a JAR - doing so  
>>> statically at initialization time or even dynamically at runtime.  
>>> E.g. to allow easily running test cases against both  
>>> implementations.
>> When you say "produce", there shouldn't be any visible difference  
>> in anything output or persisted, the impl is just how the CAS is  
>> stored internally in memory while processing is happening.
>> It won't be possible to switch the impl being used at runtime.  
>> There are classes for example with the same names but different  
>> impls (e.g. CASImpl). I know this isn't ideal for tests/comparisons  
>> between the two impls but quite a lot of things are currently  
>> tightly-coupled to the heap internals and so switching a jar  
>> doesn't seem too big a price to pay given no other code changes are  
>> needed.
> What do you plan to be the ultimate goal of this experiment? Is it  
> to support different CAS implementations or is it to replace the  
> existing CAS implementation with a totally different one?
> Most things in UIMA are created through factories (not the CAS so  
> far). So theoretically, one could replace most classes by custom  
> classes by reconfiguring the framework to use different factory  
> classes or having the factories produce different implementations.  
> Can you imagine that as well for the CAS?

For users the implementation shouldn't matter. They shouldn't observe  
any functional difference and therefore shouldn't really care if the  
impl changes underneath. All consuming code should work as-is, with  
the exception of code which accesses 'internals' directly - but I'd  
see this as analogous to accessing private fields in some java SDK  
class, which breaks when those fields change in a newer SDK version.

As such I don't think it would make sense (or be very practical from a  
maintenance pov) to support two implementations concurrently or to  
have a factory.

> Does it mean that the UIMA-C++ implementation is going to be  
> discontinued officially?

No, just to clarify no agreements or plans have been made. I just  
wanted to initiate a discussion around this as a possible idea.
If we were to pursue this alternate implementation, I don't know of  
any reason why the C++ impl would be discontinued. I had just listed  
C++ AEs as one of the things which don't yet work with my current  

>>> Having to recompile the JCas classes is a bit of a blocker to me -  
>>> but I remember that Marshall was contemplating about a way to  
>>> generate JCas classes at runtime, so this might just be a  
>>> temporary blocker.
>> When I say recompile, I don't mean regenerate using JCasGen, just  
>> recompile .class files from the existing jcas .java files. I would  
>> expect that you would typically only be using one version (other  
>> than for comparison purposes - to validate functional equivalence  
>> and/or compare performance), and so this isn't something that would  
>> need to be done often.
> Compiled JCas classes tend to be shipped as part of frameworks. This  
> means that it will not be possible to switch to a new CAS impl just  
> by replacing a JAR. It will also mean that components from different  
> UIMA-based frameworks cannot be mixed and matched anymore unless  
> some broker like UIMA-AS is used.

The current JCas cover class format is quite old and tightly-coupled  
to the heap-based CAS internals. Saying that all new versions of UIMA  
must be binary-compatible with these therefore imposes a (somewhat  
crippling) restriction on possible internal improvements. You might  
say that the current JCas classes break standard  
abstraction/encapsulation principles if the expectation is they will  
be forever forwards binary-compatible.

It would not be hard on the UIMA side to move to a simpler and more  
abstract JCas cover class format that should avoid this problem in  
future, but the actual move to such a format would be even more  
disruptive than requiring a recompilation (would require a  
re-JCasGen), and would have the same issues you mention above.

I managed to make this object-based impl at least source-compatible  
with existing jcas cover classes, by 'converting' the impl of methods  
called that were intended to make CAS heap changes to actually be  
manipulating the FS objects directly.

>>> In one context, we also rely heavily on CAS addresses serving as  
>>> unique identifiers of feature structures in the CAS. Does your  
>>> implementation provide any stable feature structure IDs,  
>>> preferably ones that are part of the system and not actually  
>>> declared as features?
>> Yes, there are various cases where an 'equivalent' of an FS address  
>> is required (for example if the LL API is being used). In this case  
>> the id gets allocated on the fly and will subsequently be unique to  
>> that FS within the CAS. In many cases an FS might never have such  
>> an ID allocated (it's not really part of the non-LL "public" APIs),  
>> but you can always 'request' one.
> I imagine that IDs would be necessary to implement stuff like  
> delta-CAS later on too.
> Are any of the changes so far in any way related to potentially  
> allowing additions to the type system at runtime?

Not directly related; my goal was just to make the implementation  
functionally equivalent but threadsafe (and simpler, faster).
But it's possible (not certain) this new impl may impose fewer  
barriers to enabling such capability.

> What would be the incentive/benefit for the developer of a  
> UIMA-based framework/applications or for the users of such  
> frameworks/applications to switch to the new implementation?

That was the "summary of advantages" I had in the original email, I've  
included it again below. The primary "external" benefits I think are  
the CAS being thread-safe and faster to manipulate. I understand that  
many users/developers might not care about these things, just as they  
likely wouldn't care about the code footprint or complexity of the  
internals, but it also shouldn't adversely impact them to "upgrade" to  
a new UIMA version based on this implementation.

I feel that not being able to have more than one thread work on a CAS  
at the same time is a major limitation, especially given modern  
systems typically have many CPU cores.

- Drastic simplification of code - most proprietary data structure  
impls removed, many other classes removed, index/index repo impls are  
about 25% of the size of the heap versions (good for future  
- Thread safety - multiple logically independent annotators can work  
on the same CAS concurrently - reading, writing and iterating over  
feature structures. Opens up a lot of parallelism possibilities
- No need for heap resizing or wasted space in fixed size CAS backing  
arrays, no large up-front memory cost for CASes - pooling them should  
no longer be necessary
- Unlike the current heap impl, when a FS is removed from CAS indices  
it's space is actually freed (can be GC'd)
- Unification of CAS and JCas - cover class instance (if it exists)  
"is" the feature structure
- Significantly better performance (speed) for many use-cases,  
especially where there is heavy access of CAS data
- Usage of standard Java data structure classes means it can benefit  
more "for free" from ongoing improvements in the java SDK and from  
hardware optimizations targeted at these classes

> Cheers,
> -- Richard

View raw message