uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeffrey Sorensen (JIRA)" <uima-...@incubator.apache.org>
Subject [jira] Commented: (UIMA-1041) UIMACPP Pythonator issues with annotation offsets and lengths - off by 1 errors
Date Mon, 02 Jun 2008 14:53:45 GMT

    [ https://issues.apache.org/jira/browse/UIMA-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601642#action_12601642
] 

Jeffrey Sorensen commented on UIMA-1041:
----------------------------------------


The problem is caused ConvertUnicodeStringRef  function in the pythonnator source,
due to the behavior of PyUnicode_DecodeUTF16 as documented here
http://docs.python.org/api/builtinCodecs.html

Byte order marks will not be copied into the target string.  Looking at the source code
for Python, the following comment can be found in the PyUnicode_DecodeUTF16
source

    /* Check for BOM marks (U+FEFF) in the input and adjust current
       byte order setting accordingly. In native mode, the leading BOM
       mark is skipped, in all other modes, it is copied to the output
       stream as-is (giving a ZWNBSP character). */

this suggests that providing any value for the byteorder parameter will cause byte-order
marks to be preserved.

Hence, my proposed replacement code is as follows

static bool ConvertUnicodeStringRef(const UnicodeStringRef &ref,
        PyObject **rv) {
  if (sizeof(Py_UNICODE) == sizeof(UChar)) {
    *rv = PyUnicode_FromUnicode((const Py_UNICODE*) ref.getBuffer(),
        ref.length());
  } else {
    // test for big-endian, preset python decoder for native order
    // this will prevent PyUnicode_DecodeUTF16 from deleting byte order marks
    union { long l; char c[sizeof(long)]; } u;
    u.l = 1;
    int byteorder = (u.c[sizeof(long) - 1] == 1) ? 1 : -1;
    PyObject *r = PyUnicode_DecodeUTF16(
       (const char *) ref.getBuffer(), ref.getSizeInBytes(), 0, &byteorder);
    if (r==0) return false;
    *rv = r;
  }
  return true;
}

where the test for endian ness was lifted from this page
http://unixpapa.com/incnote/byteorder.html

Jeff


> UIMACPP Pythonator issues with annotation offsets and lengths - off by 1 errors
> -------------------------------------------------------------------------------
>
>                 Key: UIMA-1041
>                 URL: https://issues.apache.org/jira/browse/UIMA-1041
>             Project: UIMA
>          Issue Type: Bug
>          Components: C++ Framework
>         Environment: RedHat, UIMACPP 2.2.2 release candidate 01, uima base 2.2.2
>            Reporter: Marshall Schor
>
> The sample python script when run in the document analyzer shows annotations where the
highlight is always missing the last character, and the details show the offsets for the begin
and end to be both one to low.
> To reproduce, run the sample script in the python directory of the scriptators (after
doing a build /install of the pythonator following the directions in the python directory
in python.html).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message