uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: [jira] [Commented] (UIMA-3141) Binary CAS format 6 + type filtering fails to deserialize document annotation correctly
Date Mon, 05 Aug 2013 12:54:54 GMT
One other thought experiment I use when considering new design extensions is this:

Is this new thing (much) more likely to be a user error vs a user intent?

In this case, the question would be: when a user defines a subtype of
DocumentAnnotation and then "filters" it out during serialization/
deserialization, is this (much) more likely to be a user error, where the user
would benefit from an (warning) error message about this, versus something that
is (much) more likely a popular use case?

If it is much more likely a user error, we could have UIMA detect this and issue
a warning/error message.


On 8/5/2013 7:55 AM, Marshall Schor wrote:
> Thanks for expanding on this issue, see some comments below.
> On 8/5/2013 4:25 AM, Richard Eckart de Castilho wrote:
>>>> Richard Eckart de Castilho commented on UIMA-3141:
>>>> --------------------------------------------------
>>>> If the custom sub-type of DocumentAnnotation is part of the target type system,
it works (not verified in exactly the given test case, but in the context form which this
test was distilled).
>>>> Since the document annotation is a special annotation in UIMA, it may require
special handling. I would expect that all features are set if they are available on the document
annotation, even if the type of the document annotation is not the same.
>>> I'm not following... All features of DocMeta are set.  It's the entire type
>>> instance of DocMeta that's being "filtered out" when deserializing.
>>> I'm probably not understanding your point correctly though - please say more
>> I think my point is this: if the type T for a given FS is not available in the target
type system, but a type S which is the supertype of T in the source type system is available
in the target type system, then an instance of S should be created and all features should
be set.
>> Now, stating it like this, it becomes obvious that this is probably not the best
idea in general. Eventually nothing would be filtered, because everything inherits from TOP.
But I'll still go on and explain what lead me to believe this would be a good idea, at least
in certain cases.
>> Case 1: custom document annotation
>> First off, this point is moot when the document annotation type is customized as
described in the UIMA documentation [1]. However, not everybody follows that documentation.
E.g. Ruta and DKPro Core instead customize the document annotation type by deriving from it.
> This use was a surprise to me, and I wonder about the utility of it, as compared
> to extending the DocumentAnnotation by adding more features to it.  I'm
> wondering why the original designers of UIMA didn't declare this type to be a
> type which could not be inherited from.
>> The document annotation is quite special. There are methods in the CAS interface
(e.g. getDocumentLanguage()) which internally access the document annotation, but this is
not obvious. It appears that the language is just a property of the CAS itself.
> I agree this design is a bit unusual, and I don't know the reason it was done
> this way, other than I know there was a desire to keep UIMA independent of the
> actual kind of unstructured information being processed, and the designers were
> aware that not all unstructured data was "text" (think of audio, video, etc). 
> So my guess of the motivation behind this is that "language" was not part of the
> CAS, but rather part of the DocumentAnnotation, which was specific to "text". 
> But for convenience, the set/get methods were added to the CAS interface.
>> When loading data from a binary CAS with a customized document annotation type into
a target CAS with another document annotation type (either custom or default), one would expect
that such general information as the document language should be preserved. It is basically
mandatory that the language feature exists in any kind of document annotation, since it is
blessed with its own dedicated getter/setter methods in the CAS interfaces.
> So, I suppose we could special - case this feature. But it's not clear in the
> general case how to design this.  The general case might include situations
> where users declared multiple subtypes of DocumentationAnnotation, or even
> subtypes of subtypes (in a supertype chain), and set some of their "language"
> features to several different values.  Some subset of these might be "filtered",
> but others might still exist.
> I think this is a surprising thing for users to do; however, I was surprised
> that users made subtypes of Document Annotation.  And I wonder if the better
> solution is to deprecate making subtypes of Document Annotation, rather than
> trying to find a way to handle these kinds of cases.
>> Case 2: tags as types
>> Several type systems model tags/categories as types. A typical type hierarchy would
e.g. contain a type PartOfSpeech and a sub-type Noun, Verb, etc. (often categories from a
specific tag set are used). The PartOfSpeech type tends to also have a feature holding the
tag value, e.g. "tag" which assumes values such as "NN", "NNP", etc. (generally from a specific
tag set, even if the sub-types mentioned before may be more coarse-grained.). 
>> Assume one is serializing a CAS containing such tag sub-types, e.g. in an annotation
editor. Now the user reconfigures the type system, e.g. switching from coarse-grained tag
types ("Noun") to fine grained tag types ("NN", "NNP, etc.). Then the user loads the data
back. Currently, all the annotations of type "Noun" would be lost, because the "Noun" type
does not exist anymore. It would be useful if they had just been downgraded to "PartOfSpeech"
annotations, which now could be upgraded to the new "NN", "NNP" types.
> I wonder if supporting this kind of up-classing is sufficiently useful and
> general to be part of the form 6 serialization / deserialization.  I can imagine
> many other kinds of type system "conversions" that users might want. 
> The general topic of type system conversion is a complex one.  I think more
> complex forms of type conversion are an orthogonal topic to compressed binary
> serialization.  More complex forms of this probably don't belong in form 6
> serialization/deserialization, which I think should be limited to the simpler
> type and feature filtering, which is also done in other serialization /
> deserialization forms, too, when "lenient" forms are used.  (CasCopier also has
> lenient forms, too).
>> As mentioned before, generally falling back to super-types is an obviously bad idea,
even though there may be use cases where this can help (case 2). However, I still think that
specially blessed information, such as document language should be preserved, even if the
document annotation type is changed (case 1). 
> Is there a real, frequently appearing situation?  Why wouldn't one include the
> DocMeta type in the target type system?    I think that in the general case
> (where users could design an arbitrary tree of subtypes of DocumentAnnotation,
> and instantiate one or more of these types, and then filter one or more of these
> types), there is not an obvious design on how to "pick" the right language
> setting and how to promote it or if it needs promoting.   I think this whole
> area can easily go beyond the design intent of UIMA (which was to encourage
> interoperability and sharing in a growing community of people working in
> unstructured analysis), and that the better solution is to gradually enforce the
> simpler approach by deprecating type definitions that try to be a subtype of
> DocumentAnnotation, unless of course there are valid use-cases for doing this
> (which I'm unaware of at the moment :-) ).
> -Marshall

View raw message