uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Klügl (JIRA) <...@uima.apache.org>
Subject [jira] [Commented] (UIMA-4453) MARKTABLE action works improperly
Date Wed, 10 Jun 2015 20:13:01 GMT

    [ https://issues.apache.org/jira/browse/UIMA-4453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581011#comment-14581011

Peter Klügl commented on UIMA-4453:

The problem is caused by the combination of filtering settings in the rule script and the
entries in the table. The table lookup is not able to see whitespaces since these are filtered
by default. However, the table contains entries with spaces. This can cause problems since
the table uses a trie structure for representing the column data. There is no lookahead when
automatically skipping spaces in the entries. Therefore, the matches for entries fail that
have chars that also occur after whitespaces in other entries. 

There are several ways to solve or avoid this problem.
- remove the whitespaces in dict.csv (best way right now, just tested it, but makes the table
hard to read)
- activate a special configurations parameter (is currently missing in BasicEngine.xml and
in the generated descriptors, and has probably still some problems in your use case. This
should be the nice solution for the first point)
- make the lookup process sensible to whitespaces (This is often not wanted and needs a different
configuration of the table call and rules)

The difference to UIMA Ruta 2.2.1 to 2.3.0 is caused by UIMA-4079, where a problem with whitespaces
in tables has been fixed.

> MARKTABLE action works improperly
> ---------------------------------
>                 Key: UIMA-4453
>                 URL: https://issues.apache.org/jira/browse/UIMA-4453
>             Project: UIMA
>          Issue Type: Bug
>          Components: ruta
>    Affects Versions: 2.3.0ruta
>         Environment: OS X 10.9.1, Java v8u45, Eclipse Luna
> Windows 7, Java v8u45, Eclipse Luna
>            Reporter: Oleg Fedoriaka
>            Assignee: Peter Klügl
>             Fix For: 2.1.0ruta
>   Original Estimate: 96h
>  Remaining Estimate: 96h
> New available UIMA Ruta Runtime 2.7.0 & Workbench 2.3.0 for Eclipse has lost proper
functionality of MARKTABLE action.  This action stopped annotating of all words from a csv
file. I had noticed that the problem happened only for words written in Cyrillic witch contains
spaces, i.e. for Latin it works fine. Please use sample outlined below in order to reproduce
the problem i'm talking about.
> # script/main.ruta
> WORDTABLE Dict = 'dict.csv';
> DECLARE Annotation Test (STRING meaning);
> Document {-> MARKTABLE(Test,1,Dict, "meaning" = 2)};
> # resources/dict.csv
> від;from
> с какой стати;why
> с которой;fromWhich
> сюда;here
> по какому;which
> сюди;here
> как нибудь;somehow
> сколько;howMuch
> # input/test.txt
> від с какой стати с которой сюда по какому сюди
как нибудь сколько
> After main.ruta script execution we wont get annotated everything from test.txt Worth
mentioning that Cyrillic letter like 'с' at the beginning of string, somehow affecting on
processing behavior. Moreover, by removing lines with spaces, will get rid us from the issue
described above.

This message was sent by Atlassian JIRA

View raw message