uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Hill <apa...@nickhill.org>
Subject Re: Alternate CAS implementation
Date Tue, 07 Apr 2015 22:27:35 GMT
Of course, I'm in the same boat! :-)

Regards,
Nick

Quoting Marshall Schor <msa@schor.com>:

> OK, thanks!  Please be patient - volunteers at work (who have other  
> "day" jobs
> :-)  )
>
> -Marshall
>
> On 4/7/2015 4:18 PM, Nick Hill wrote:
>> Per Marshall's suggestion I've created a "brainstorming" jira issue and
>> attached the current prototype code:
>>
>> https://issues.apache.org/jira/browse/UIMA-4329
>>
>> Regards,
>> Nick
>>
>> Quoting Peter Kl├╝gl <peter.kluegl@averbis.com>:
>>
>>> Hi Nick,
>>>
>>> I am (of course) also interested in the alternate CAS implementation.
>>>
>>> I agree with Marshall that the code should be attached to an jira issue so
>>> that we can take a closer look and investigate its impact for the vairous
>>> tools and libraries (in my case UIMA Ruta).
>>>
>>> Best,
>>>
>>> Peter
>>>
>>> Am 06.04.2015 um 02:04 schrieb Nick Hill:
>>>> First I just want to emphasize that this proposal doesn't necessarily have
>>>> full endorsement yet from Marshall, Eddie, et al. so I won't  
>>>> comment on the
>>>> strategic roadmap questions. I'm just attempting to make the case for a
>>>> direction which to me seems very natural.
>>>>
>>>> Logically UIMA is an object graph with some fixed type system,  
>>>> rooted in one
>>>> or more collections with different ordering/uniqueness rules.  
>>>> These are all
>>>> things Java provides out-of-the box, with a highly evolved heap/GC engine
>>>> and numerous powerful SDK collection classes. Thus I would argue the most
>>>> simple and "obvious" way to implement the UIMA specification  
>>>> would be a thin
>>>> layer on top of these existing constructs, and there would need  
>>>> to be quite
>>>> a compelling reason to deviate significantly from this.
>>>>
>>>> I completely understand that in the past such reasons existed,  
>>>> but it's now
>>>> 10 years later and JVMs and hardware have moved on considerably.  
>>>> I'm fairly
>>>> certain that those reasons are no longer valid today, and yet we are still
>>>> paying the cost of significant complexity for something with more
>>>> limitations than we would have otherwise. By carving out static chunks of
>>>> heap and doing a proprietary form of "memory management" within them it
>>>> prevents the JVM from optimizing how/where this data is stored and
>>>> collected. This is very much working against how the JVM was  
>>>> designed to be
>>>> used.
>>>>
>>>> The standard UIMA (non-JCas) CAS APIs allow typesystem-agnostic based CAS
>>>> manipulation, so my understanding is that the only reason for  
>>>> performing low
>>>> level or direct heap access is to get better performance out of  
>>>> the current
>>>> array-based impl. I was assuming such usage would be rare by users of UIMA
>>>> in general, but I could be wrong and it's useful to know people out there
>>>> like you are doing it. I'd argue though the fact that it is even necessary
>>>> is another reason for changing the approach (given that a goal of  
>>>> UIMA is to
>>>> minimize effort required by NLP developers). Could you elaborate on your
>>>> usage of internal APIs?
>>>>
>>>> Regarding serialization formats, each of them is just a  
>>>> well-defined serial
>>>> representation of a CAS, so should not be affected by the runtime
>>>> implementation.
>>>> I do understand that the current binary formats derive from the CAS array
>>>> internals, but it doesn't mean that this new impl couldn't read/write that
>>>> same format. I expect here specifically there may be a relative  
>>>> performance
>>>> impact because of the 'reconstruction' of the heaps that would be needed,
>>>> however:
>>>> - In a way, keeping the CAS in this form in memory could be seen as
>>>> optimizing for speed of this specific binary format at the  
>>>> expense of slower
>>>> and less flexible runtime CAS access
>>>> - Alternative binary serialization mechanisms (and formats) could also be
>>>> used similar to standard java object serialization, which I  
>>>> expect would be
>>>> just as fast (although not default java serialization which is very
>>>> inefficient)
>>>> - I'd question in any case whether this alone should dictate the overall
>>>> architecture choice
>>>>
>>>>> I was my impression in the past, that UIMA-Core has always valued
>>>>> compatibility very high, even to the point of adding switches to  
>>>>> re-enabled
>>>>> buggy/undesired behavior in case somebody depended on it.
>>>>
>>>> I understand this, and I think I managed to keep things functionally
>>>> identical. I'm not proposing any change in behaviour.
>>>>
>>>>> Changing the implementation of the CAS is probably the most radical idea
>>>>> I've seen so far in this project.
>>>>
>>>> It might be radical in terms of the implementation change but  
>>>> again I would
>>>> argue it's really just a simplification of the internals. It shouldn't be
>>>> radical at all for users of UIMA in general.
>>>>
>>>>> Are we going to slay the holy cow of compatibility now and if  
>>>>> yes at which
>>>>> levels?
>>>>
>>>> Even for this change, the source incompatibility only applies to  
>>>> JCas cover
>>>> classes, and only to those because of their current  
>>>> implementation-specific
>>>> format.
>>>>
>>>>> What does such a change mean to the various sub-projects (DUCC, UIMA-AS,
>>>>> RUTA, uimaFIT)?
>>>>
>>>> As long as these projects don't directly manipulate the CAS arrays, there
>>>> should be zero impact to them apart from I would hope some performance
>>>> benefits. It would also mean in future they could exploit the thread-safe
>>>> nature of the CAS for various purposes.
>>>>
>>>> Regards,
>>>> Nick
>>>>
>>>> Quoting Richard Eckart de Castilho <rec@apache.org>:
>>>>
>>>>> On 03.04.2015, at 22:51, Marshall Schor <msa@schor.com> wrote:
>>>>>
>>>>>> It may be good to open a "Brainstorming" Jira, and attach the  
>>>>>> code you're
>>>>>> thinking of donating, so that people could study it and have a  
>>>>>> more concrete
>>>>>> idea about this.
>>>>>>
>>>>>> If it eventually gets accepted, we would also need a Software Grant
for
>>>>>> this, I
>>>>>> think, due to the size.
>>>>>
>>>>> I was my impression in the past, that UIMA-Core has always valued
>>>>> compatibility very high, even to the point of adding switches to  
>>>>> re-enabled
>>>>> buggy/undesired behavior in case somebody depended on it. Changing the
>>>>> implementation of the CAS is probably the most radical idea I've seen
so
>>>>> far in this project. In principle, I very much like seeing UIMA  
>>>>> to evolve,
>>>>> but I do wonder how such a radical change is imagined to be undertaken.
>>>>>
>>>>> I'm aware that there are various levels of compatibility. My  
>>>>> impression so
>>>>> far was that source-compatibility was typically not sufficient  
>>>>> in the past.
>>>>>
>>>>> Are we going to slay the holy cow of compatibility now and if  
>>>>> yes at which
>>>>> levels?
>>>>>
>>>>> Is there some willingness now to consider setting up a road-map for a
>>>>> UIMA-Core version 3?
>>>>>
>>>>> What does such a change mean to the various sub-projects (DUCC, UIMA-AS,
>>>>> RUTA, uimaFIT)?
>>>>>
>>>>> Personally, I'd be curious to see how much of e.g. DKPro Core or WebAnno
>>>>> breaks with such a new implementation. I imagine quite a lot since I've
>>>>> become quite fond of binary serialization and internal API usage  
>>>>> lately (in
>>>>> some cases I might be able to switch to official low-level CAS API...).
>>>>> Although I'm very much for evolution and adopting newer technologies,
I'm
>>>>> afraid testing this (and potentially fixing stuff) will be quite time
>>>>> intensive. Given that in my context, most of the benefits are not very
>>>>> relevant so far, such testing would only make sense to me if it  
>>>>> was part of
>>>>> a larger strategic change - and I think that a properly licensed
>>>>> contribution would be pretty much a pre-requisite to even look at it
in
>>>>> detail.
>>>>>
>>>>> Marshall, Eddie, and Nick do you have some vision of a strategic UIMA
>>>>> roadmap that you can share with us?
>>>>>
>>>>> Cheers,
>>>>>
>>>>> -- Richard
>>>>
>>
>>
>>



Mime
View raw message