uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Hill (JIRA)" <...@uima.apache.org>
Subject [jira] [Updated] (UIMA-4329) Object-based CAS implementation proposal/prototype
Date Thu, 09 Apr 2015 06:35:12 GMT

     [ https://issues.apache.org/jira/browse/UIMA-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Nick Hill updated UIMA-4329:
----------------------------
    Description: 
I have been experimenting with a simplified CAS implementation where each feature structure
is an object and the indices are based on standard Java SDK concurrent collection classes.
This replaces the complex custom array-based heaps and index implementations.

The primary motivation was to make the CAS threadsafe so that multiple annotators could process
one concurrently, but I think there are a number of other benefits.

Summary of advantages:
- Drastic simplification of code - most proprietary data structure impls removed, many other
classes removed, index/index repo impls are about 25% of the size of the heap versions (good
for future enhancements/maintainability)
- Thread safety - multiple logically independent annotators can work on the same CAS concurrently
- reading, writing and iterating over feature structures. Opens up a lot of parallelism possibilities
- No need for heap resizing or wasted space in fixed size CAS backing arrays, no large up-front
memory cost for CASes - pooling them should no longer be necessary
- Unlike the current heap impl, when a FS is removed from CAS indices it's space is actually
freed (can be GC'd)
- Unification of CAS and JCas - cover class instance (if it exists) "is" the feature structure
- Significantly better performance (speed) for many use-cases, especially where there is heavy
access of CAS data
- Usage of standard Java data structure classes means it can benefit more "for free" from
ongoing improvements in the java SDK and from hardware optimizations targeted at these classes

I was hoping to see if there's interest from the community in taking this further, maybe even
as a replacement for the current impl in a future version of uima-core. There has already
been some discussion on the mailing list under the subject "Alternate CAS implementation".

I'm attaching the current prototype, which should support most existing UIMA functionality
with the exception of:
- Binary serialization/deserialization
- C/C++ framework (requires binary serialization)
- "Delta" CAS related function including CAS markers
- Index "auto protection" (recent 2.7 feature)

Note I don't mean to imply these things can't be supported, just that they aren't yet.

Where these things aren't used it should be possible to try out the attached uima-core.jar
as a drop-in replacement with existing apps/frameworks. An important caveat though is that
any existing JCas cover classes will need recompiling with the new jar (but not re-JCasGenning).

I'll also attach the code. I started by basically ripping out the CAS heaps, so there's a
lot of code which is just commented out (e.g. in CASImpl.java). Lots of cleanup/tidyup is
still needed, and theres various places which still need fixing for threadsafety (e.g. synchronization
around some existing create-on-first-access logic.. this is separate to the indices though).
But those things shouldn't affect existing usage. A convention I followed was not to rename
modified classes (e.g. CASImpl), but where an equivalent impl was created from scratch I did
give it a new name starting with "CC" (e.g. FeatureStructureImpl is now CCFeatureStructure).
The cc stood for "concurrent CAS". I have kept it in sync with the latest compatible changes
in the uima-core stream, apart from those related to the non-impl'd functions mentioned above.

Most of the "valid" unit tests work. Some are tied to the internals and no longer apply, many
don't compile because they use binary serialization and/or delta CAS related classes which
I removed for the time being. Some others I had to generalize a bit because for example they
assumed a specific order in places where the order should be arbitrary, and maybe some other
similar reasons.

md5 checksums:
{{94499c8f18f832fd1ded9106c64e8c76 *uima-core_obj-0.2.jar}}
{{0cac18e89c616a8270e810f34b6468ad *uimaj-core_obj-0.2.tar.gz}}


  was:
I have been experimenting with a simplified CAS implementation where each feature structure
is an object and the indices are based on standard Java SDK concurrent collection classes.
This replaces the complex custom array-based heaps and index implementations.

The primary motivation was to make the CAS threadsafe so that multiple annotators could process
one concurrently, but I think there are a number of other benefits.

Summary of advantages:
- Drastic simplification of code - most proprietary data structure impls removed, many other
classes removed, index/index repo impls are about 25% of the size of the heap versions (good
for future enhancements/maintainability)
- Thread safety - multiple logically independent annotators can work on the same CAS concurrently
- reading, writing and iterating over feature structures. Opens up a lot of parallelism possibilities
- No need for heap resizing or wasted space in fixed size CAS backing arrays, no large up-front
memory cost for CASes - pooling them should no longer be necessary
- Unlike the current heap impl, when a FS is removed from CAS indices it's space is actually
freed (can be GC'd)
- Unification of CAS and JCas - cover class instance (if it exists) "is" the feature structure
- Significantly better performance (speed) for many use-cases, especially where there is heavy
access of CAS data
- Usage of standard Java data structure classes means it can benefit more "for free" from
ongoing improvements in the java SDK and from hardware optimizations targeted at these classes

I was hoping to see if there's interest from the community in taking this further, maybe even
as a replacement for the current impl in a future version of uima-core. There has already
been some discussion on the mailing list under the subject "Alternate CAS implementation".

I'm attaching the current prototype, which should support most existing UIMA functionality
with the exception of:
- Binary serialization/deserialization
- C/C++ framework (requires binary serialization)
- "Delta" CAS related function including CAS markers
- Index "auto protection" (recent 2.7 feature)

Note I don't mean to imply these things can't be supported, just that they aren't yet.

Where these things aren't used it should be possible to try out the attached uima-core.jar
as a drop-in replacement with existing apps/frameworks. An important caveat though is that
any existing JCas cover classes will need recompiling with the new jar (but not re-JCasGenning).

I'll also attach the code. I started by basically ripping out the CAS heaps, so there's a
lot of code which is just commented out (e.g. in CASImpl.java). Lots of cleanup/tidyup is
still needed, and theres various places which still need fixing for threadsafety (e.g. synchronization
around some existing create-on-first-access logic.. this is separate to the indices though).
But those things shouldn't affect existing usage. A convention I followed was not to rename
modified classes (e.g. CASImpl), but where an equivalent impl was created from scratch I did
give it a new name starting with "CC" (e.g. FeatureStructureImpl is now CCFeatureStructure).
The cc stood for "concurrent CAS". I have kept it in sync with the latest compatible changes
in the uima-core stream, apart from those related to the non-impl'd functions mentioned above.

Most of the "valid" unit tests work. Some are tied to the internals and no longer apply, many
don't compile because they use binary serialization and/or delta CAS related classes which
I removed for the time being. Some others I had to generalize a bit because for example they
assumed a specific order in places where the order should be arbitrary, and maybe some other
similar reasons.

md5 checksums:
{{4fd19b5f804fe8d505f697240c8e0366 *uima-core.jar}}
{{51826aa44111b7f6e1fa307393eda8f4 *uimaj-core_obj.tar.gz}}



> Object-based CAS implementation proposal/prototype
> --------------------------------------------------
>
>                 Key: UIMA-4329
>                 URL: https://issues.apache.org/jira/browse/UIMA-4329
>             Project: UIMA
>          Issue Type: Brainstorming
>          Components: Core Java Framework
>            Reporter: Nick Hill
>            Priority: Minor
>         Attachments: uima-core_obj-0.2.jar, uimaj-core_obj-0.2.tar.gz
>
>
> I have been experimenting with a simplified CAS implementation where each feature structure
is an object and the indices are based on standard Java SDK concurrent collection classes.
This replaces the complex custom array-based heaps and index implementations.
> The primary motivation was to make the CAS threadsafe so that multiple annotators could
process one concurrently, but I think there are a number of other benefits.
> Summary of advantages:
> - Drastic simplification of code - most proprietary data structure impls removed, many
other classes removed, index/index repo impls are about 25% of the size of the heap versions
(good for future enhancements/maintainability)
> - Thread safety - multiple logically independent annotators can work on the same CAS
concurrently - reading, writing and iterating over feature structures. Opens up a lot of parallelism
possibilities
> - No need for heap resizing or wasted space in fixed size CAS backing arrays, no large
up-front memory cost for CASes - pooling them should no longer be necessary
> - Unlike the current heap impl, when a FS is removed from CAS indices it's space is actually
freed (can be GC'd)
> - Unification of CAS and JCas - cover class instance (if it exists) "is" the feature
structure
> - Significantly better performance (speed) for many use-cases, especially where there
is heavy access of CAS data
> - Usage of standard Java data structure classes means it can benefit more "for free"
from ongoing improvements in the java SDK and from hardware optimizations targeted at these
classes
> I was hoping to see if there's interest from the community in taking this further, maybe
even as a replacement for the current impl in a future version of uima-core. There has already
been some discussion on the mailing list under the subject "Alternate CAS implementation".
> I'm attaching the current prototype, which should support most existing UIMA functionality
with the exception of:
> - Binary serialization/deserialization
> - C/C++ framework (requires binary serialization)
> - "Delta" CAS related function including CAS markers
> - Index "auto protection" (recent 2.7 feature)
> Note I don't mean to imply these things can't be supported, just that they aren't yet.
> Where these things aren't used it should be possible to try out the attached uima-core.jar
as a drop-in replacement with existing apps/frameworks. An important caveat though is that
any existing JCas cover classes will need recompiling with the new jar (but not re-JCasGenning).
> I'll also attach the code. I started by basically ripping out the CAS heaps, so there's
a lot of code which is just commented out (e.g. in CASImpl.java). Lots of cleanup/tidyup is
still needed, and theres various places which still need fixing for threadsafety (e.g. synchronization
around some existing create-on-first-access logic.. this is separate to the indices though).
But those things shouldn't affect existing usage. A convention I followed was not to rename
modified classes (e.g. CASImpl), but where an equivalent impl was created from scratch I did
give it a new name starting with "CC" (e.g. FeatureStructureImpl is now CCFeatureStructure).
The cc stood for "concurrent CAS". I have kept it in sync with the latest compatible changes
in the uima-core stream, apart from those related to the non-impl'd functions mentioned above.
> Most of the "valid" unit tests work. Some are tied to the internals and no longer apply,
many don't compile because they use binary serialization and/or delta CAS related classes
which I removed for the time being. Some others I had to generalize a bit because for example
they assumed a specific order in places where the order should be arbitrary, and maybe some
other similar reasons.
> md5 checksums:
> {{94499c8f18f832fd1ded9106c64e8c76 *uima-core_obj-0.2.jar}}
> {{0cac18e89c616a8270e810f34b6468ad *uimaj-core_obj-0.2.tar.gz}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message