uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: Design choices for changing type systems with loaded JCas classes [was Re: UIMAv3 & WebAnno}
Date Mon, 08 Jan 2018 20:14:10 GMT

On 1/8/2018 1:31 PM, Richard Eckart de Castilho wrote:
> On 08.01.2018, at 16:16, Marshall Schor <msa@schor.com> wrote:
>> After a lot of thought, here's a proposal, along the lines Richard suggests:
>> The basic idea is to have the JCas classes, if they exist for some type, augment
>> that type with features defined only in the JCas class.
>> This augmentation would be done at type system commit time, and would really
>> modify the type system being committed to have the extra features.  Because the
>> type system would be modified to include these extra features, the Feature
>> Structures made with these "augmented" types would be larger (because they would
>> have slots for these features).  This insures that subtypes' features won't
>> overlap / collide with the expanded features.
>> I'll work out the details, and see if I can make this change.
> After some though, I believe the problem with the availability and ordering of
> features can be sidestepped if we consider the JCas classes as a canonical source
> for type system definitions.

I'm not sure we need to say which is the canonical source and which is the
augmenting source.  In the proposal, both are used. 

Note that in both V2 and V3, JCas class definitions are optional.
In V3, the "built-in" ones are always present, and used. 

It is perfectly OK (and is often done, for cases where the type system is not
knowable at "compile time", for example, for general purpose annotators designed
to "discover" at run time the type system in use, etc.), for JCas types *not* to
exist that correspond to user-defined types.

> JCas classes represent a pretty strong and rigid contract on the type system and
> there can only be one set of the available through a particular classloader at any given
> XML TSDs on the other hand are comparably flexible and a dime a dozend. Arbitrary
> numbers of them can be merged and used to initialize a CAS.
> So my suggestion would be: when using the JCas API, then JCas classes are treated
> as the canonical source for the type system definition. 

I believe to make things work, both the type system definition and the JCas
definition(s) need to be used.  I'm missing what the "canonical" part does.  It
might be something that gives "priority" to two different definitions that
conflict, but the current code instead is treating that as an error which needs
to be resolved (e.g., you can't have a feature with a range of "uima.cas.String"
in the type definition, and that same feature having a range of
"uima.cas.Integer" in the JCas.) 

I think many users use a combination of JCas APIs and pure CAS APIs.  They use
the JCas APIs for common things like
annotator begin/ end, but write general purpose annotators that work with
arbitrary subtypes of these, where the type is unknown at compile time, and
therefore cannot have a custom JCas class definition.

> They define which types
> exist, which parent types they have, and what is the order of the features. If
> a user provides additional TSDs when initializing a CAS, then these are merged
> on top of the definitions sourced from the JCas classes. In this way, features
> defined in JCas classes can never be missing and they always have a defined order,
> irrespective of the presence of any other TSDs. 
See my other note mentioning issues around Pears - essentially multiple class
loaders per pipeline.
> If any addition features are
> defined in TSDs, then they need to be access through the CAS API anyway. I believe
> there would also be no issues with subtypes in this "JCas first" scenario.
I'm not seeing there's any difference in a "JCas first", versus "consider both"
> This approach would also avoid that accessing features defined in JCas but not
> defined in an XML TSD would trigger an error, since the features are defined
> via their presence in the JCas class.

I think this suggestion is the same as what I was proposing (except for calling
out one of the sources as "first"). 
I don't think it matters which is "first" - the type system description or the
JCas version. 
The proposal uses both of these, and if a feature is defined in both, it is
required to be the same.
> A potential downside is, that users who initialize CAS with a small XML TSD but
> who have rich JCas classes on the classpath might end up with more memory usage
> than they asked for - I assume that would rarely happen. 
I agree.  I mentioned that in the "proposal".
> This could be mitigated
> by only initializing JCas classes if their types are actually defined in the
> user-provided TSD at initialization time. 
Good point.  It is often true by default, because if the JCas class is not
referenced in loaded coded, it won't be loaded, if
there's no type definition corresponding to it.  This would happen in the
implementation anyway, because the code that triggers the JCas loading and
augmentation of the type system, is type-system-commit, which iterates over
defined types.  JCas classes not having corresponding type definitions are never
> Finally, users who really do not want
> to have any JCas classes affect their CASes could maybe entirely disable JCas
> for a given CAS instance - I thought years ago, I had seen an option somewhere
> to do that, but I don't find it at the moment.
There was such an option in v2; this option blocked the map from v2 native FS
(by ref) to corresponding JCas instances.
In v3, all instances are instances of some JCas class, so that doesn't apply. 
I agree that some disabling option could be done (to prevent "expansion" of
extra feature slots, if not wanted).
> What do you think?
I think you ended up with the same proposal, so I agree with it :-) .  The only
remaining trouble is with Pears... to be figured out.

Thanks for thinking about this! -Marshall

> Cheers,
> -- Richard

View raw message