uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject UIMA v3 JCas design
Date Mon, 14 Sep 2015 15:55:12 GMT
The JCas in v2 provides a Java friendly access-by-named features to UIMA Feature
Structure data (stored independently of the JCas class).

In v3, we want to keep this aspect, and also move the storage of Feature
Structure data into the JCas object instance.  This allows this to be Garbage
Collected, and may allow for more locality of reference (resulting in improved
performance).

There are two ways this could be done.  One (pioneered by Nick Hill's
contribution) is to have two generic objects added to the JCas class - one
holding an array of references to other Java objects, and the other holding an
array of ints.  The latter is used for primitive data, the former for data which
is represented by other Feature Structures (JCas instances), or Strings, etc.

The other manner this could be done is to have Java fields that directly
correspond to the range type of the feature.  So for a feature "foo" whose range
was "double", we could define a field:
  private float _foo;
With this approach, the getters / setters become trivial, without any
indirection via the type system.

Experiments with this approach show it is feasible; it is possible to generate
after the type system is committed, the Java code needed, including the field
definitions.

But there is a severe issue with this, which is that once generated, Java Class
definitions cannot be changed easily. There are some workarounds, most of which
involve using multiple class loaders.  As far as I've been able to determine,
this causes a need for changes to user code to introduce new levels of class
loaders (perhaps multiples), which we would like to avoid (or at least make
"automatic" in some sense).

A particularly interesting use case to consider is one where a UIMA application
is written without using the JCas in v2, which runs a loop deserializing a type
system and a CAS that has that type system, and processing it (perhaps referring
to common "built-in" types, or other special types that it determines
dynamically via reading other values in the CAS).   In the case of generated
JCas storage which had separate fields for each Feature, the generation would
need to be done for each type system.  Because definitions cannot be replaced,
the new ones would need to be loaded under a fresh class loader.  And, in case
the user code made use of the JCas, that code would also need to be loaded under
that same fresh class loader so it would "see" the new JCas classes. This means
it would undergo any JIT-ing etc. once again.

Because of these issues with class loading in the approach where the storage is
done "directly", I'm now thinking that the level of indirection introduced by
having the data be stored more generically and accessed via the type system (an
indirection), is perhaps a better way to go.  It may even be possible to
separate the data storage part of the JCas from the JCas part, allowing a
smoother transition to v3 (yet to be investigated).

-Marshall 

Mime
View raw message