uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard Eckart de Castilho (JIRA)" <...@uima.apache.org>
Subject [jira] [Commented] (UIMA-4329) Object-based CAS implementation proposal/prototype
Date Thu, 09 Apr 2015 07:19:12 GMT

    [ https://issues.apache.org/jira/browse/UIMA-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486883#comment-14486883

Richard Eckart de Castilho commented on UIMA-4329:

I'm fooling around a bit with the alternative uimaj-core.

For my purposes, I changed the pom.xml in your file so that the artifactId is "uimaj-core"
and the version is "2.7.1-nick-snapshot". That makes it easier for me to update just the core
JAR via Maven dependency management mechanisms for multi-module projects

It appears there may be changes in the file that do not pertain to the alternative CAS. One
test does not compile because a "AnalysisEngineManagementImpl.getRootName("bar")" method is

In WebAnno [1], we heavily rely on LowLevelCas, CAS addresses, and binary serialization in
all forms. That shows when dropping in the alternative CAS impl:

* the method getAddress() is undefined for <JCas class> - we need a stable identifier
for feature structures that also can remain stable across serialization (for this we currently
rely on CASCompleteSerializer, because binary form 6 doesn't keep the addresses stable)
* the method getLowLevelCas() is undefined for type JCas - we use this as an alternative way
to access/resolve CAS addresses
* import org.apache.uima.cas.impl.CASCompleteSerializer cannot be resolved - used as a fast
serializiation that maintains CAS addresses
* import org.apache.uima.cas.impl.Serialization cannot be resolved - used as a fast serializiation
that maintains CAS addresses

We decided to use the CAS addresses because they offer a fast and convenient way of random
access to feature structures in the CAS and because we didn't have to mingle with the type
system. In this way, the type system can be kept free of WebAnno-specific information.

If we wanted to switch WebAnno to the new CAS implementation, we'd need

* some way of uniquely identifying FSes even across serialization. A short ID like an integer
would be convenient (i.e. no UUID)
* a fast (de-)serialization of the CAS

I understand that you consider adding both of these features anyway.

As an upgrade path for our users, we could provide a command line tool to convert all data
to XMI and then to a second tool to convert the XMI to a new fast binary serialization format.
It would be more convenient of course if both CAS implementations could co-exist in the same
JVM because then we wouldn't need two tools for conversion (ok, we could do classloader magic
to work around this and actually load two instances of the framework but that's also not the
most trivial approach...).

Next I'll look at DKPro Core.

[1] http://webanno.googlecode.com

> Object-based CAS implementation proposal/prototype
> --------------------------------------------------
>                 Key: UIMA-4329
>                 URL: https://issues.apache.org/jira/browse/UIMA-4329
>             Project: UIMA
>          Issue Type: Brainstorming
>          Components: Core Java Framework
>            Reporter: Nick Hill
>            Priority: Minor
>         Attachments: uima-core_obj-0.2.jar, uimaj-core_obj-0.2.tar.gz
> I have been experimenting with a simplified CAS implementation where each feature structure
is an object and the indices are based on standard Java SDK concurrent collection classes.
This replaces the complex custom array-based heaps and index implementations.
> The primary motivation was to make the CAS threadsafe so that multiple annotators could
process one concurrently, but I think there are a number of other benefits.
> Summary of advantages:
> - Drastic simplification of code - most proprietary data structure impls removed, many
other classes removed, index/index repo impls are about 25% of the size of the heap versions
(good for future enhancements/maintainability)
> - Thread safety - multiple logically independent annotators can work on the same CAS
concurrently - reading, writing and iterating over feature structures. Opens up a lot of parallelism
> - No need for heap resizing or wasted space in fixed size CAS backing arrays, no large
up-front memory cost for CASes - pooling them should no longer be necessary
> - Unlike the current heap impl, when a FS is removed from CAS indices it's space is actually
freed (can be GC'd)
> - Unification of CAS and JCas - cover class instance (if it exists) "is" the feature
> - Significantly better performance (speed) for many use-cases, especially where there
is heavy access of CAS data
> - Usage of standard Java data structure classes means it can benefit more "for free"
from ongoing improvements in the java SDK and from hardware optimizations targeted at these
> I was hoping to see if there's interest from the community in taking this further, maybe
even as a replacement for the current impl in a future version of uima-core. There has already
been some discussion on the mailing list under the subject "Alternate CAS implementation".
> I'm attaching the current prototype, which should support most existing UIMA functionality
with the exception of:
> - Binary serialization/deserialization
> - C/C++ framework (requires binary serialization)
> - "Delta" CAS related function including CAS markers
> - Index "auto protection" (recent 2.7 feature)
> Note I don't mean to imply these things can't be supported, just that they aren't yet.
> Where these things aren't used it should be possible to try out the attached uima-core.jar
as a drop-in replacement with existing apps/frameworks. An important caveat though is that
any existing JCas cover classes will need recompiling with the new jar (but not re-JCasGenning).
> I'll also attach the code. I started by basically ripping out the CAS heaps, so there's
a lot of code which is just commented out (e.g. in CASImpl.java). Lots of cleanup/tidyup is
still needed, and theres various places which still need fixing for threadsafety (e.g. synchronization
around some existing create-on-first-access logic.. this is separate to the indices though).
But those things shouldn't affect existing usage. A convention I followed was not to rename
modified classes (e.g. CASImpl), but where an equivalent impl was created from scratch I did
give it a new name starting with "CC" (e.g. FeatureStructureImpl is now CCFeatureStructure).
The cc stood for "concurrent CAS". I have kept it in sync with the latest compatible changes
in the uima-core stream, apart from those related to the non-impl'd functions mentioned above.
> Most of the "valid" unit tests work. Some are tied to the internals and no longer apply,
many don't compile because they use binary serialization and/or delta CAS related classes
which I removed for the time being. Some others I had to generalize a bit because for example
they assumed a specific order in places where the order should be arbitrary, and maybe some
other similar reasons.
> md5 checksums:
> {{94499c8f18f832fd1ded9106c64e8c76 *uima-core_obj-0.2.jar}}
> {{0cac18e89c616a8270e810f34b6468ad *uimaj-core_obj-0.2.tar.gz}}

This message was sent by Atlassian JIRA

View raw message