uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jasper Huzen (JIRA)" <...@uima.apache.org>
Subject [jira] [Updated] (UIMA-5752) Problem with matching items in MarkTable with whitespacers visible
Date Mon, 19 Mar 2018 20:17:00 GMT

     [ https://issues.apache.org/jira/browse/UIMA-5752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jasper Huzen updated UIMA-5752:
-------------------------------
    Description: 
The change / fix in UIMA-4556 cause some problems when using a CSV file with whitespaces.

When we have a dictionary with whitespaces between words and

>> Param PARAM_DICT_REMOVE_WS is TRUE:

When WS are visible in the token stream:
 - words with spacers are not recognized (as expected).

When WS are NOT visible in the token stream:
 - all items in the dictionary will be recognized
 - all items will also be recognized if you add whitespaces between words. For example: IlikeRUTA,
Ilike Ruta, I like Ruta all result in the same match.

>> Param PARAM_DICT_REMOVE_WS is FALSE:

When WS are visible in the token stream:
 - not all entries in the dictionary will be recognized

When WS are NOT visible in the token stream:
 - also not all entries in the dictionary will be recognized

The problem that this cause is that the default value to ignore whitespaces is always true
(hardcoded).
{code:java}
private IBooleanExpression ignoreWS = new SimpleBooleanExpression(true);
{code}
This is not correct because if you want to use whitespaces (if they are important) that won't 
work. The matcher should use the same value as set in the PARAM_DICT_REMOVE_WS parameter
or the value that is set via setIgnoreWS method.

-I attached a patch to fix this issue.-

I'm working on a patch.

  was:
The change / fix in UIMA-4556 cause some problems when using a CSV file with whitespaces.

When we have a dictionary with whitespaces between words and

>> Param PARAM_DICT_REMOVE_WS is TRUE:

When WS are visible in the token stream:
 - words with spacers are not recognized (as expected).

When WS are NOT visible in the token stream:
 - all items in the dictionary will be recognized
 - all items will also be recognized if you add whitespaces between words. For example: IlikeRUTA,
Ilike Ruta, I like Ruta all result in the same match.

>> Param PARAM_DICT_REMOVE_WS is FALSE:

When WS are visible in the token stream:
 - not all entries in the dictionary will be recognized

When WS are NOT visible in the token stream:
 - also not all entries in the dictionary will be recognized



The problem that this cause is that the default value to ignore whitespaces is always true
(hardcoded).
{code:java}
private IBooleanExpression ignoreWS = new SimpleBooleanExpression(true);
{code}
This is not correct because if you want to use whitespaces (if they are important) that won't 
work. The matcher should use the same value as set in the PARAM_DICT_REMOVE_WS parameter
or the value that is set via setIgnoreWS method.

I attached a patch to fix this issue.


> Problem with matching items in MarkTable with whitespacers visible
> ------------------------------------------------------------------
>
>                 Key: UIMA-5752
>                 URL: https://issues.apache.org/jira/browse/UIMA-5752
>             Project: UIMA
>          Issue Type: Bug
>          Components: Ruta
>    Affects Versions: 2.6.1ruta
>            Reporter: Jasper Huzen
>            Priority: Major
>
> The change / fix in UIMA-4556 cause some problems when using a CSV file with whitespaces.
> When we have a dictionary with whitespaces between words and
> >> Param PARAM_DICT_REMOVE_WS is TRUE:
> When WS are visible in the token stream:
>  - words with spacers are not recognized (as expected).
> When WS are NOT visible in the token stream:
>  - all items in the dictionary will be recognized
>  - all items will also be recognized if you add whitespaces between words. For example:
IlikeRUTA, Ilike Ruta, I like Ruta all result in the same match.
> >> Param PARAM_DICT_REMOVE_WS is FALSE:
> When WS are visible in the token stream:
>  - not all entries in the dictionary will be recognized
> When WS are NOT visible in the token stream:
>  - also not all entries in the dictionary will be recognized
> The problem that this cause is that the default value to ignore whitespaces is always
true (hardcoded).
> {code:java}
> private IBooleanExpression ignoreWS = new SimpleBooleanExpression(true);
> {code}
> This is not correct because if you want to use whitespaces (if they are important) that
won't  work. The matcher should use the same value as set in the PARAM_DICT_REMOVE_WS parameter
or the value that is set via setIgnoreWS method.
> -I attached a patch to fix this issue.-
> I'm working on a patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message