uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject possible core UIMA bug or limitation - plain binary delta serialization
Date Wed, 06 Jan 2016 21:49:17 GMT
While working on converting the plain binary serialization code for UIMA v3
work, I ran across what looks like a problem.

The plain (not compressed) binary serialization form with delta cas support is
sometimes used for communicating between distributed UIMA-AS services and
clients (it does require all the type systems be identical, though).  The client
sends a full CAS to the service, which then keeps track of changes made
subsequently.  When the time comes to return the CAS to the client, "delta" CAS
serialization sends back just the new things created in the CAS plus any changes
to existing things.

The binary serialization code appears to have a bug or limitation for sending
changes made to existing entries in short or long arrays; this limitation
doesn't exist for boolean/byte arrays.

In pseudocode - what's done for these changes is to send
 (1) an int : the number of changes following
 (2) for each change: an int representing the address into the aux heap of the item
 (3) for each change: a byte/short/int represent the value

The bug is in line 2: for the short and long arrays, this is sent as a "short"
instead of as an int; for the byte (also used for boolean arrays), this is sent
as an "int" which I think is correct.

This means that serialization will give wrong results if there's a change to
some item in the short or long aux heaps which is indexed beyond 32767 items.

I think this should be fixed; but it will "break" compatibility with any stored
existing serialized form, and furthermore, for UIMA-AS transport use, both the
client and the server will need coordinated updates.

As a minimum, this should probably check to see if the error would be occurring
(trying to serialize some change at slot > 32767, and throw an exception.

If we change this to use write an "int" (instead of short), we could add a
global configuration flag to disable this, too, if needed for some backward
compatibility purpose.

I would welcome opinions on how best to approach this...


View raw message