uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Klügl (JIRA) <...@uima.apache.org>
Subject [jira] [Commented] (UIMA-4079) MarkTable action not able to recognize entities with two or more words
Date Sat, 01 Nov 2014 14:42:33 GMT

    [ https://issues.apache.org/jira/browse/UIMA-4079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14193209#comment-14193209
] 

Peter Klügl commented on UIMA-4079:
-----------------------------------

Unfortunately, that's not that easy.

I assume that the problem can be observed when entries of two tokens are not assigned to feature
values. I gonna explain the problem for word lists and dictionary lookup in general. It's
the same thing for word tables.

Ruta provides a coverage-based concept of visibilty for rules. Text covered by an annotation
of a type that is filtered is not visible to rules. One strength of the dictionary lookup
in ruta is that it is also able to use this functionality. You can configure text spans that
should be ignored by the dictionary lookup with FILTERTYPE and friends. This means that the
dictionary lookup never sees a whitespace when the default filtering seetings are used. The
actual string provided to the dictionary is not "Bill Clinton" but "BillClinton". Therefore,
it does not matter if there is one space or several spaces (or any kind of invisible text)
between "Bill" and "Clinton". If we would only use "getCoveredText()", then the lookup would
fail in many scenarios. 

Dictionary entries like "Bill Clinton" are only found using the default filtering settings
due to a convinience method that skips whitespaces in the trie (dictionary char nodes). This
actually also causes the problem that sometimes entires are not found in the documents if
the dictionary contains entries that provide ambiguous paths in the trie. 

I do not really want to change this strategy because it allows the user to specify whitespace-sensitive
dictionaries, which contain entires with different combinations of whitespaces.

Afterall, the increased expressiveness comes with the price that users have problems applying
the dictionaries. We should do sometime about that. I normally suggest removing all unimportant
chars in the dictionary entries, but that is not really a convinient approach for users.

There are several things that we can do in order to improve it:
- I could introduce a parameter (in the engine) that when activated removes all whitespaces
when the dictionaries are loaded. (However, we would need to consider multi tree word lists).
This would lead to whitespace-insensitive dictionaries for all applied script files in the
engine. 
- I could introduce a fall-back method that uses "getCoveredText" if "getVisibleCoveredText"
has not found any entires, or a method that checks their existence ignoring spaces. This would
suffice in most scenarios, but is not able to provide the complete fucntionaity because you
never know or will never be able to reproduce the current visibility setting within the dictionaries.
The annotations are simply not present.
- I could refactor the complete lookup process in order to remember the row of the table in
which the entry was matched. Then, the problematic code mentioned in the question would not
be necessary. However, this refactoring should not be done before the refactoring of the complete
dictionary stuff.

Any opinions?


> MarkTable action not able to recognize entities with two or more words
> ----------------------------------------------------------------------
>
>                 Key: UIMA-4079
>                 URL: https://issues.apache.org/jira/browse/UIMA-4079
>             Project: UIMA
>          Issue Type: Bug
>          Components: ruta
>    Affects Versions: 2.2.2ruta
>            Reporter: Silvestre Losada
>             Fix For: 2.2.2ruta
>
>
> I think this error was introduced solving UIMA-4071. The problem is that  RutaStream.getVisibleCoveredText
method removes whitespaces in covered text. For example Bill Clinton covered text returns
BillClinton.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message