uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: Alternate CAS implementation
Date Fri, 03 Apr 2015 20:51:31 GMT
It may be good to open a "Brainstorming" Jira, and attach the code you're
thinking of donating, so that people could study it and have a more concrete
idea about this.

If it eventually gets accepted, we would also need a Software Grant for this, I
think, due to the size. 

-Marshall


On 4/2/2015 3:55 PM, Nick Hill wrote:
> Thanks Richard, more replies below...
>
> Quoting Richard Eckart de Castilho <rec@apache.org>:
>
>> Hi Nick,
>>
>> On 02.04.2015, at 01:37, Nick Hill <apache@nickhill.org> wrote:
>>
>>>> From my point of view, it would be nice if it was possible to configure the
>>>> UIMA framework to produce either this new kind of CAS or the old one
>>>> without having to exchange a JAR - doing so statically at initialization
>>>> time or even dynamically at runtime. E.g. to allow easily running test
>>>> cases against both implementations.
>>>
>>> When you say "produce", there shouldn't be any visible difference in
>>> anything output or persisted, the impl is just how the CAS is stored
>>> internally in memory while processing is happening.
>>>
>>> It won't be possible to switch the impl being used at runtime. There are
>>> classes for example with the same names but different impls (e.g. CASImpl).
>>> I know this isn't ideal for tests/comparisons between the two impls but
>>> quite a lot of things are currently tightly-coupled to the heap internals
>>> and so switching a jar doesn't seem too big a price to pay given no other
>>> code changes are needed.
>>
>> What do you plan to be the ultimate goal of this experiment? Is it to support
>> different CAS implementations or is it to replace the existing CAS
>> implementation with a totally different one?
>>
>> Most things in UIMA are created through factories (not the CAS so far). So
>> theoretically, one could replace most classes by custom classes by
>> reconfiguring the framework to use different factory classes or having the
>> factories produce different implementations. Can you imagine that as well for
>> the CAS?
>
> For users the implementation shouldn't matter. They shouldn't observe any
> functional difference and therefore shouldn't really care if the impl changes
> underneath. All consuming code should work as-is, with the exception of code
> which accesses 'internals' directly - but I'd see this as analogous to
> accessing private fields in some java SDK class, which breaks when those
> fields change in a newer SDK version.
>
> As such I don't think it would make sense (or be very practical from a
> maintenance pov) to support two implementations concurrently or to have a
> factory.
>
>> Does it mean that the UIMA-C++ implementation is going to be discontinued
>> officially?
>
> No, just to clarify no agreements or plans have been made. I just wanted to
> initiate a discussion around this as a possible idea.
> If we were to pursue this alternate implementation, I don't know of any reason
> why the C++ impl would be discontinued. I had just listed C++ AEs as one of
> the things which don't yet work with my current prototype.
>
>>>> Having to recompile the JCas classes is a bit of a blocker to me - but I
>>>> remember that Marshall was contemplating about a way to generate JCas
>>>> classes at runtime, so this might just be a temporary blocker.
>>>
>>> When I say recompile, I don't mean regenerate using JCasGen, just recompile
>>> .class files from the existing jcas .java files. I would expect that you
>>> would typically only be using one version (other than for comparison
>>> purposes - to validate functional equivalence and/or compare performance),
>>> and so this isn't something that would need to be done often.
>>
>> Compiled JCas classes tend to be shipped as part of frameworks. This means
>> that it will not be possible to switch to a new CAS impl just by replacing a
>> JAR. It will also mean that components from different UIMA-based frameworks
>> cannot be mixed and matched anymore unless some broker like UIMA-AS is used.
>
> The current JCas cover class format is quite old and tightly-coupled to the
> heap-based CAS internals. Saying that all new versions of UIMA must be
> binary-compatible with these therefore imposes a (somewhat crippling)
> restriction on possible internal improvements. You might say that the current
> JCas classes break standard abstraction/encapsulation principles if the
> expectation is they will be forever forwards binary-compatible.
>
> It would not be hard on the UIMA side to move to a simpler and more abstract
> JCas cover class format that should avoid this problem in future, but the
> actual move to such a format would be even more disruptive than requiring a
> recompilation (would require a re-JCasGen), and would have the same issues you
> mention above.
>
> I managed to make this object-based impl at least source-compatible with
> existing jcas cover classes, by 'converting' the impl of methods called that
> were intended to make CAS heap changes to actually be manipulating the FS
> objects directly.
>
>>>> In one context, we also rely heavily on CAS addresses serving as unique
>>>> identifiers of feature structures in the CAS. Does your implementation
>>>> provide any stable feature structure IDs, preferably ones that are part of
>>>> the system and not actually declared as features?
>>>
>>> Yes, there are various cases where an 'equivalent' of an FS address is
>>> required (for example if the LL API is being used). In this case the id gets
>>> allocated on the fly and will subsequently be unique to that FS within the
>>> CAS. In many cases an FS might never have such an ID allocated (it's not
>>> really part of the non-LL "public" APIs), but you can always 'request' one.
>>
>> I imagine that IDs would be necessary to implement stuff like delta-CAS later
>> on too.
>>
>> Are any of the changes so far in any way related to potentially allowing
>> additions to the type system at runtime?
>
> Not directly related; my goal was just to make the implementation functionally
> equivalent but threadsafe (and simpler, faster).
> But it's possible (not certain) this new impl may impose fewer barriers to
> enabling such capability.
>
>> What would be the incentive/benefit for the developer of a UIMA-based
>> framework/applications or for the users of such frameworks/applications to
>> switch to the new implementation?
>
> That was the "summary of advantages" I had in the original email, I've
> included it again below. The primary "external" benefits I think are the CAS
> being thread-safe and faster to manipulate. I understand that many
> users/developers might not care about these things, just as they likely
> wouldn't care about the code footprint or complexity of the internals, but it
> also shouldn't adversely impact them to "upgrade" to a new UIMA version based
> on this implementation.
>
> I feel that not being able to have more than one thread work on a CAS at the
> same time is a major limitation, especially given modern systems typically
> have many CPU cores.
>
> - Drastic simplification of code - most proprietary data structure impls
> removed, many other classes removed, index/index repo impls are about 25% of
> the size of the heap versions (good for future enhancements/maintainability)
> - Thread safety - multiple logically independent annotators can work on the
> same CAS concurrently - reading, writing and iterating over feature
> structures. Opens up a lot of parallelism possibilities
> - No need for heap resizing or wasted space in fixed size CAS backing arrays,
> no large up-front memory cost for CASes - pooling them should no longer be
> necessary
> - Unlike the current heap impl, when a FS is removed from CAS indices it's
> space is actually freed (can be GC'd)
> - Unification of CAS and JCas - cover class instance (if it exists) "is" the
> feature structure
> - Significantly better performance (speed) for many use-cases, especially
> where there is heavy access of CAS data
> - Usage of standard Java data structure classes means it can benefit more "for
> free" from ongoing improvements in the java SDK and from hardware
> optimizations targeted at these classes
>
>>
>> Cheers,
>>
>> -- Richard
>
>
>



Mime
View raw message