uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rinat Gareyev (JIRA)" <...@uima.apache.org>
Subject [jira] [Created] (UIMA-2455) Make ordering of getNextAnnotations result configurable
Date Tue, 14 Aug 2012 16:52:38 GMT
Rinat Gareyev created UIMA-2455:

             Summary: Make ordering of getNextAnnotations result configurable
                 Key: UIMA-2455
                 URL: https://issues.apache.org/jira/browse/UIMA-2455
             Project: UIMA
          Issue Type: New Feature
          Components: TextMarker
            Reporter: Rinat Gareyev

Example rule:

Example text:
aText bText cText cMoreText

where following correspondence between annotations and tokens are held:
A = aText
B = bText
C = cText
C = cText cMoreText

Rule results in the following:
D = cText

However I expect that:
D = cText cMoreText

The reason of actual behaviour is org.apache.uima.textmarker.rule.AnnotationComparator#compare
implementation. It returns a shorter annotation before longer. That is why the sequence 'aText
bText cText' will be matched and sequence 'aText bText cText cMoreText' will not because it
will be considered later and will not pass NOT PARTOF condition.

I've revealed this after migration to the latest TextMarker sources (from ASF repo). Before
we used the one from Sourceforge.net. In the old (sourceforge) version this problem did not
arise because TextMarkerBasic could keep only one annotation per Type as 'begin anchor'. Returning
to the example this means that 'cText' TextMarkerBasic held only one C annotation as begin

In current (rev. 1371274) version TextMarkerBasic keeps a set of begin and end anchors per
Type. This is actually a good improvement.
But I suggest to make ordering of anchored annotations returned by TextMarkerRuleElement#getNextAnnotations(boolean,
AnnotationFS, TextMarkerStream) method more controllable.
E.g., by adding some parameter for TextMarkerEngine or script which will define AnnotationComparator#compare

Also returning longer annotations before shorter ones seems to be more compliant to the UIMA
default indexing. See http://uima.apache.org/d/uimaj-2.4.0/references.html#ugr.ref.cas.index.built_in_indexes

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message