uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Eckart de Castilho <richard.eck...@gmail.com>
Subject Re: [jira] [Commented] (UIMA-3141) Binary CAS format 6 + type filtering fails to deserialize document annotation correctly
Date Mon, 05 Aug 2013 08:25:57 GMT
>> Richard Eckart de Castilho commented on UIMA-3141:
>> --------------------------------------------------
>> If the custom sub-type of DocumentAnnotation is part of the target type system, it
works (not verified in exactly the given test case, but in the context form which this test
was distilled).
>> Since the document annotation is a special annotation in UIMA, it may require special
handling. I would expect that all features are set if they are available on the document annotation,
even if the type of the document annotation is not the same.
> I'm not following... All features of DocMeta are set.  It's the entire type
> instance of DocMeta that's being "filtered out" when deserializing.
> I'm probably not understanding your point correctly though - please say more .

I think my point is this: if the type T for a given FS is not available in the target type
system, but a type S which is the supertype of T in the source type system is available in
the target type system, then an instance of S should be created and all features should be

Now, stating it like this, it becomes obvious that this is probably not the best idea in general.
Eventually nothing would be filtered, because everything inherits from TOP. But I'll still
go on and explain what lead me to believe this would be a good idea, at least in certain cases.

Case 1: custom document annotation

First off, this point is moot when the document annotation type is customized as described
in the UIMA documentation [1]. However, not everybody follows that documentation. E.g. Ruta
and DKPro Core instead customize the document annotation type by deriving from it.

The document annotation is quite special. There are methods in the CAS interface (e.g. getDocumentLanguage())
which internally access the document annotation, but this is not obvious. It appears that
the language is just a property of the CAS itself. 

When loading data from a binary CAS with a customized document annotation type into a target
CAS with another document annotation type (either custom or default), one would expect that
such general information as the document language should be preserved. It is basically mandatory
that the language feature exists in any kind of document annotation, since it is blessed with
its own dedicated getter/setter methods in the CAS interfaces.

Case 2: tags as types

Several type systems model tags/categories as types. A typical type hierarchy would e.g. contain
a type PartOfSpeech and a sub-type Noun, Verb, etc. (often categories from a specific tag
set are used). The PartOfSpeech type tends to also have a feature holding the tag value, e.g.
"tag" which assumes values such as "NN", "NNP", etc. (generally from a specific tag set, even
if the sub-types mentioned before may be more coarse-grained.). 

Assume one is serializing a CAS containing such tag sub-types, e.g. in an annotation editor.
Now the user reconfigures the type system, e.g. switching from coarse-grained tag types ("Noun")
to fine grained tag types ("NN", "NNP, etc.). Then the user loads the data back. Currently,
all the annotations of type "Noun" would be lost, because the "Noun" type does not exist anymore.
It would be useful if they had just been downgraded to "PartOfSpeech" annotations, which now
could be upgraded to the new "NN", "NNP" types.

As mentioned before, generally falling back to super-types is an obviously bad idea, even
though there may be use cases where this can help (case 2). However, I still think that specially
blessed information, such as document language should be preserved, even if the document annotation
type is changed (case 1). 


-- Richard

[1] http://uima.apache.org/d/uimaj-2.4.1/references.html#ugr.ref.jcas.documentannotation_issues
View raw message