uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Hill <apa...@nickhill.org>
Subject Re: Alternate CAS implementation
Date Mon, 06 Apr 2015 00:04:39 GMT
First I just want to emphasize that this proposal doesn't necessarily  
have full endorsement yet from Marshall, Eddie, et al. so I won't  
comment on the strategic roadmap questions. I'm just attempting to  
make the case for a direction which to me seems very natural.

Logically UIMA is an object graph with some fixed type system, rooted  
in one or more collections with different ordering/uniqueness rules.  
These are all things Java provides out-of-the box, with a highly  
evolved heap/GC engine and numerous powerful SDK collection classes.  
Thus I would argue the most simple and "obvious" way to implement the  
UIMA specification would be a thin layer on top of these existing  
constructs, and there would need to be quite a compelling reason to  
deviate significantly from this.

I completely understand that in the past such reasons existed, but  
it's now 10 years later and JVMs and hardware have moved on  
considerably. I'm fairly certain that those reasons are no longer  
valid today, and yet we are still paying the cost of significant  
complexity for something with more limitations than we would have  
otherwise. By carving out static chunks of heap and doing a  
proprietary form of "memory management" within them it prevents the  
JVM from optimizing how/where this data is stored and collected. This  
is very much working against how the JVM was designed to be used.

The standard UIMA (non-JCas) CAS APIs allow typesystem-agnostic based  
CAS manipulation, so my understanding is that the only reason for  
performing low level or direct heap access is to get better  
performance out of the current array-based impl. I was assuming such  
usage would be rare by users of UIMA in general, but I could be wrong  
and it's useful to know people out there like you are doing it. I'd  
argue though the fact that it is even necessary is another reason for  
changing the approach (given that a goal of UIMA is to minimize effort  
required by NLP developers). Could you elaborate on your usage of  
internal APIs?

Regarding serialization formats, each of them is just a well-defined  
serial representation of a CAS, so should not be affected by the  
runtime implementation.
I do understand that the current binary formats derive from the CAS  
array internals, but it doesn't mean that this new impl couldn't  
read/write that same format. I expect here specifically there may be a  
relative performance impact because of the 'reconstruction' of the  
heaps that would be needed, however:
- In a way, keeping the CAS in this form in memory could be seen as  
optimizing for speed of this specific binary format at the expense of  
slower and less flexible runtime CAS access
- Alternative binary serialization mechanisms (and formats) could also  
be used similar to standard java object serialization, which I expect  
would be just as fast (although not default java serialization which  
is very inefficient)
- I'd question in any case whether this alone should dictate the  
overall architecture choice

> I was my impression in the past, that UIMA-Core has always valued  
> compatibility very high, even to the point of adding switches to  
> re-enabled buggy/undesired behavior in case somebody depended on it.

I understand this, and I think I managed to keep things functionally  
identical. I'm not proposing any change in behaviour.

> Changing the implementation of the CAS is probably the most radical  
> idea I've seen so far in this project.

It might be radical in terms of the implementation change but again I  
would argue it's really just a simplification of the internals. It  
shouldn't be radical at all for users of UIMA in general.

> Are we going to slay the holy cow of compatibility now and if yes at  
> which levels?

Even for this change, the source incompatibility only applies to JCas  
cover classes, and only to those because of their current  
implementation-specific format.

> What does such a change mean to the various sub-projects (DUCC,  
> UIMA-AS, RUTA, uimaFIT)?

As long as these projects don't directly manipulate the CAS arrays,  
there should be zero impact to them apart from I would hope some  
performance benefits. It would also mean in future they could exploit  
the thread-safe nature of the CAS for various purposes.

Regards,
Nick

Quoting Richard Eckart de Castilho <rec@apache.org>:

> On 03.04.2015, at 22:51, Marshall Schor <msa@schor.com> wrote:
>
>> It may be good to open a "Brainstorming" Jira, and attach the code you're
>> thinking of donating, so that people could study it and have a more concrete
>> idea about this.
>>
>> If it eventually gets accepted, we would also need a Software Grant  
>> for this, I
>> think, due to the size.
>
> I was my impression in the past, that UIMA-Core has always valued  
> compatibility very high, even to the point of adding switches to  
> re-enabled buggy/undesired behavior in case somebody depended on it.  
> Changing the implementation of the CAS is probably the most radical  
> idea I've seen so far in this project. In principle, I very much  
> like seeing UIMA to evolve, but I do wonder how such a radical  
> change is imagined to be undertaken.
>
> I'm aware that there are various levels of compatibility. My  
> impression so far was that source-compatibility was typically not  
> sufficient in the past.
>
> Are we going to slay the holy cow of compatibility now and if yes at  
> which levels?
>
> Is there some willingness now to consider setting up a road-map for  
> a UIMA-Core version 3?
>
> What does such a change mean to the various sub-projects (DUCC,  
> UIMA-AS, RUTA, uimaFIT)?
>
> Personally, I'd be curious to see how much of e.g. DKPro Core or  
> WebAnno breaks with such a new implementation. I imagine quite a lot  
> since I've become quite fond of binary serialization and internal  
> API usage lately (in some cases I might be able to switch to  
> official low-level CAS API...). Although I'm very much for evolution  
> and adopting newer technologies, I'm afraid testing this (and  
> potentially fixing stuff) will be quite time intensive. Given that  
> in my context, most of the benefits are not very relevant so far,  
> such testing would only make sense to me if it was part of a larger  
> strategic change - and I think that a properly licensed contribution  
> would be pretty much a pre-requisite to even look at it in detail.
>
> Marshall, Eddie, and Nick do you have some vision of a strategic  
> UIMA roadmap that you can share with us?
>
> Cheers,
>
> -- Richard



Mime
View raw message