uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Klügl (JIRA) <...@uima.apache.org>
Subject [jira] [Commented] (UIMA-2524) TextMarker html conversion to plain text is not working correctly
Date Wed, 19 Dec 2012 09:43:13 GMT

    [ https://issues.apache.org/jira/browse/UIMA-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13535821#comment-13535821
] 

Peter Klügl commented on UIMA-2524:
-----------------------------------

It is probably better to separate these two functionalities, annotating html files with annotations
for the html tags and converting html files while retaining the annotations for the html tags.
Thus, a new analysis engine for the second functionality can also be used on a CAS, which
contains also other annotations. This would result in a useful analysis engine for converting
CAS with html artifacts. I will refactor the HTMLAnnotator and remove the code for stripping
the html tags, and I will create an new issue for the additional analysis engine.
                
> TextMarker html conversion to plain text is not working correctly
> -----------------------------------------------------------------
>
>                 Key: UIMA-2524
>                 URL: https://issues.apache.org/jira/browse/UIMA-2524
>             Project: UIMA
>          Issue Type: Bug
>          Components: TextMarker
>    Affects Versions: 2.0.0TextMarker
>            Reporter: Peter Klügl
>            Assignee: Peter Klügl
>
> The HTMLAnnoator shipped with TextMarker is able to strip the html tag and to create
an additional view with the plain text. During this step the tag information is converted
to annotations, whose offsets are adapted according to the removed tags. This functionality
is not working correctly: the tags of the body of the html document are not removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message