uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Klügl (JIRA) <...@uima.apache.org>
Subject [jira] [Commented] (UIMA-5757) Unable to extract features when annotation ends with HTML tag
Date Tue, 24 Apr 2018 11:57:00 GMT

    [ https://issues.apache.org/jira/browse/UIMA-5757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16449701#comment-16449701

Peter Klügl commented on UIMA-5757:

Using the default filtering setting, annotations that start or end with MARKUP are not visible.
There is an exception for the Document/DocumentAnnotation annotation which is always visible
and can be matched respectively. However, all other annotations on the same offsets follow
the common filtering rules. Thus, in order to match on annotations of specific types that
cover the complete sofa string, you need to retain all filtered types, e.g., MARKUP and maybe
BREAK/ES in your example.

> Unable to extract features when annotation ends with HTML tag
> -------------------------------------------------------------
>                 Key: UIMA-5757
>                 URL: https://issues.apache.org/jira/browse/UIMA-5757
>             Project: UIMA
>          Issue Type: Bug
>          Components: Ruta
>    Affects Versions: 2.6.1ruta
>         Environment: RUTA 2.6.1, Windows 10, Eclipse Mars, JDK 1.8.0_144
>            Reporter: Miguel Alvarez
>            Priority: Minor
> If there is an annotation that covers the whole sofa string, and the sofa string ends
with an HTML tag, it seems like RUTA isn't able to extract the features for that annotation.
For instance, lets suppose this document (represented as XMI):
> {code:java}
> // XMI document
> <?xml version="1.0" encoding="UTF-8"?>
> <xmi:XMI xmlns:xmi="http://www.omg.org/XMI" xmlns:cas="http:///uima/cas.ecore" xmlns:tcas="http:///uima/tcas.ecore"
xmlns:types="http:///com/acme/uima/types.ecore" xmi:version="2.0">
> <cas:NULL xmi:id="0"/>
> <tcas:DocumentAnnotation xmi:id="8" sofa="1" begin="0" end="12" language="es"/>
> <types:MyDocument xmi:id="14" sofa="1" begin="0" end="12" documentId="test_docsize_39d5541c-5e7f-391c-95af-c82ce6306644"/>
> <cas:Sofa xmi:id="1" sofaNum="1" sofaID="_InitialView" mimeType="text" sofaString="ABCDEFGHIJ&lt;p&gt;"/>
> <cas:View sofa="1" members="8 14"/>
> </xmi:XMI>
> {code}
> And the following RUTA script:
> {code:java}
> // RUTA script
> STRING documentId = "Unknown";
> com.acme.uima.types.MyDocument{-> GETFEATURE("documentId", documentId)};
> LOG("Starting to process document: " + documentId);
> {code}
> The LOG action will output Unknown. But as soon as the string doesn't end with an HTML
tag, it works fine.
> Any ideas what could be going on?

This message was sent by Atlassian JIRA

View raw message