uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jasper Huzen (JIRA)" <...@uima.apache.org>
Subject [jira] [Updated] (UIMA-5775) Performance problem MARKTABLE when matching case insensitive
Date Mon, 14 May 2018 12:56:00 GMT

     [ https://issues.apache.org/jira/browse/UIMA-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jasper Huzen updated UIMA-5775:
    Attachment: UIMA-5775.patch

> Performance problem MARKTABLE when matching case insensitive
> ------------------------------------------------------------
>                 Key: UIMA-5775
>                 URL: https://issues.apache.org/jira/browse/UIMA-5775
>             Project: UIMA
>          Issue Type: Bug
>          Components: Ruta
>    Affects Versions: 2.6.1ruta
>            Reporter: Jasper Huzen
>            Priority: Major
>         Attachments: UIMA-5775.patch
> Hi,
> We encounter a performance issue (or maybe infinitive loop) when we use the MARKTABLE
action, with case insenstive valuelists.
> The call in our script is:
> {code:java}
> MARKTABLE(LawName, 1, 'nl_law_names.ignorecase.csv', true, 0, "", 0, "lawIdentifier"
= 2);{code}
> Using the following input fragment will result in a timeout exception after 1 minute.
> {code:java}
> Groenboek COM(2006) 105 definitief een Europese strategie voor duurzame, concurrerende
en continu geleverde energie voor Europa {SEC(2006)317}{code}
> That complete name is a Dutch lawname and also be an entry of the _nl_law_names.csv_
> When we try to match it and we have the ignoreCase flag to false, it is no problem and
fast.. If we toggle that flag to true (case is ignored), the matching is really slow or even
hanging in an infinitive loop.
> I debugged the code and pinpoint me to the _TreeWordList_ class. The recursive method
_recursiveContains_ have a potential bug. 
> I think that the problem is when the item have a special character, that it is the same
character in upper and lowercase. The recursive method will then look/fork twice on the same
tree item.
> I made a fix that check if the uppercase is the same character as the lowercase, and
in that case it only do the recursive call once. That solved the (performance) issue but I'm
not sure if this is really the main problem and the current fix is the best fix for this.

This message was sent by Atlassian JIRA

View raw message