uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Hill <apa...@nickhill.org>
Subject Alternate CAS implementation
Date Wed, 01 Apr 2015 06:03:24 GMT

Hi all, I work with Marshall and Eddie and have been using UIMA for  
some time but am new to the mailing list.

As an experiment, I re-implemented the (java) CAS internals such that  
each feature structure corresponds to a single java object instead of  
using the custom "heaps" (monolithic arrays), and indices are built  
from standard java SDK (concurrent) collection classes.

The original motivation was to make the CAS threadsafe but I think  
there are other benefits, the biggest of which may be  
reduction/simplification of the codebase.

This new impl should be fully compatible with all of the existing CAS  
APIs, with a few exceptions (see below). i.e. in most cases it can be  
a drop-in replacement for uima-core.jar. Existing JCas cover classes  
can be used but must be recompiled. I also included a "compatibility  
layer" for the low level CAS API so that existing usage of it should  
still work, but removing the heaps of course obviates the need for it.

Summary of advantages:
- Drastic simplification of code - most proprietary data structure  
impls removed, many other classes removed, index/index repo impls are  
about 25% of the size of the heap versions (good for future  
enhancements/maintainability)
- Thread safety - multiple logically independent annotators can work  
on the same CAS concurrently - reading, writing and iterating over  
feature structures. Opens up a lot of parallelism possibilities
- No need for heap resizing or wasted space in fixed size CAS backing  
arrays, no large up-front memory cost for CASes - pooling them should  
no longer be necessary
- Unlike the current heap impl, when a FS is removed from CAS indices  
it's space is actually freed (can be GC'd)
- Unification of CAS and JCas - cover class instance (if it exists)  
"is" the feature structure
- Significantly better performance (speed) for many use-cases,  
especially where there is heavy access of CAS data
- Usage of standard Java data structure classes means it can benefit  
more "for free" from ongoing improvements in the java SDK and from  
hardware optimizations targeted at these classes


Functionality not yet supported:
- Binary serialization/deserialization
- C/C++ framework (requires binary serialization)
- "Delta" CAS related function including CAS markers
- Index "auto protection" (recent 2.7 feature)

- Snapshot iterators currently return regular iterators (but all  
iterators are safe to use concurrently with modification)
- Multiple classloaders haven't been tested

There's also various other small loose ends and cleanup to do.


I was hoping to see if there's interest from the community in taking  
this further, maybe even as a replacement for the current impl in a  
future version of uima-core.

I'm not sure of the best way to share the code, but it would be great  
to have a branch in the shared SCM repo where the current prototype  
could be reviewed and collaboratively evolved to fill the remaining  
gaps.

Would welcome any comments or questions!

Thanks,
Nick


Mime
View raw message