uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: [jira] [Commented] (UIMA-3141) Binary CAS format 6 + type filtering fails to deserialize document annotation correctly
Date Mon, 05 Aug 2013 11:55:30 GMT
Thanks for expanding on this issue, see some comments below.

On 8/5/2013 4:25 AM, Richard Eckart de Castilho wrote:
>>> Richard Eckart de Castilho commented on UIMA-3141:
>>> --------------------------------------------------
>>> If the custom sub-type of DocumentAnnotation is part of the target type system,
it works (not verified in exactly the given test case, but in the context form which this
test was distilled).
>>> Since the document annotation is a special annotation in UIMA, it may require
special handling. I would expect that all features are set if they are available on the document
annotation, even if the type of the document annotation is not the same.
>> I'm not following... All features of DocMeta are set.  It's the entire type
>> instance of DocMeta that's being "filtered out" when deserializing.
>> I'm probably not understanding your point correctly though - please say more .
> I think my point is this: if the type T for a given FS is not available in the target
type system, but a type S which is the supertype of T in the source type system is available
in the target type system, then an instance of S should be created and all features should
be set.
> Now, stating it like this, it becomes obvious that this is probably not the best idea
in general. Eventually nothing would be filtered, because everything inherits from TOP. But
I'll still go on and explain what lead me to believe this would be a good idea, at least in
certain cases.
> Case 1: custom document annotation
> First off, this point is moot when the document annotation type is customized as described
in the UIMA documentation [1]. However, not everybody follows that documentation. E.g. Ruta
and DKPro Core instead customize the document annotation type by deriving from it.
This use was a surprise to me, and I wonder about the utility of it, as compared
to extending the DocumentAnnotation by adding more features to it.  I'm
wondering why the original designers of UIMA didn't declare this type to be a
type which could not be inherited from.
> The document annotation is quite special. There are methods in the CAS interface (e.g.
getDocumentLanguage()) which internally access the document annotation, but this is not obvious.
It appears that the language is just a property of the CAS itself.
I agree this design is a bit unusual, and I don't know the reason it was done
this way, other than I know there was a desire to keep UIMA independent of the
actual kind of unstructured information being processed, and the designers were
aware that not all unstructured data was "text" (think of audio, video, etc). 
So my guess of the motivation behind this is that "language" was not part of the
CAS, but rather part of the DocumentAnnotation, which was specific to "text". 
But for convenience, the set/get methods were added to the CAS interface.

> When loading data from a binary CAS with a customized document annotation type into a
target CAS with another document annotation type (either custom or default), one would expect
that such general information as the document language should be preserved. It is basically
mandatory that the language feature exists in any kind of document annotation, since it is
blessed with its own dedicated getter/setter methods in the CAS interfaces.
So, I suppose we could special - case this feature. But it's not clear in the
general case how to design this.  The general case might include situations
where users declared multiple subtypes of DocumentationAnnotation, or even
subtypes of subtypes (in a supertype chain), and set some of their "language"
features to several different values.  Some subset of these might be "filtered",
but others might still exist.

I think this is a surprising thing for users to do; however, I was surprised
that users made subtypes of Document Annotation.  And I wonder if the better
solution is to deprecate making subtypes of Document Annotation, rather than
trying to find a way to handle these kinds of cases.

> Case 2: tags as types
> Several type systems model tags/categories as types. A typical type hierarchy would e.g.
contain a type PartOfSpeech and a sub-type Noun, Verb, etc. (often categories from a specific
tag set are used). The PartOfSpeech type tends to also have a feature holding the tag value,
e.g. "tag" which assumes values such as "NN", "NNP", etc. (generally from a specific tag set,
even if the sub-types mentioned before may be more coarse-grained.). 
> Assume one is serializing a CAS containing such tag sub-types, e.g. in an annotation
editor. Now the user reconfigures the type system, e.g. switching from coarse-grained tag
types ("Noun") to fine grained tag types ("NN", "NNP, etc.). Then the user loads the data
back. Currently, all the annotations of type "Noun" would be lost, because the "Noun" type
does not exist anymore. It would be useful if they had just been downgraded to "PartOfSpeech"
annotations, which now could be upgraded to the new "NN", "NNP" types.
I wonder if supporting this kind of up-classing is sufficiently useful and
general to be part of the form 6 serialization / deserialization.  I can imagine
many other kinds of type system "conversions" that users might want. 

The general topic of type system conversion is a complex one.  I think more
complex forms of type conversion are an orthogonal topic to compressed binary
serialization.  More complex forms of this probably don't belong in form 6
serialization/deserialization, which I think should be limited to the simpler
type and feature filtering, which is also done in other serialization /
deserialization forms, too, when "lenient" forms are used.  (CasCopier also has
lenient forms, too).

> As mentioned before, generally falling back to super-types is an obviously bad idea,
even though there may be use cases where this can help (case 2). However, I still think that
specially blessed information, such as document language should be preserved, even if the
document annotation type is changed (case 1). 
Is there a real, frequently appearing situation?  Why wouldn't one include the
DocMeta type in the target type system?    I think that in the general case
(where users could design an arbitrary tree of subtypes of DocumentAnnotation,
and instantiate one or more of these types, and then filter one or more of these
types), there is not an obvious design on how to "pick" the right language
setting and how to promote it or if it needs promoting.   I think this whole
area can easily go beyond the design intent of UIMA (which was to encourage
interoperability and sharing in a growing community of people working in
unstructured analysis), and that the better solution is to gradually enforce the
simpler approach by deprecating type definitions that try to be a subtype of
DocumentAnnotation, unless of course there are valid use-cases for doing this
(which I'm unaware of at the moment :-) ).


View raw message