uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: possible core UIMA bug or limitation - plain binary delta serialization
Date Wed, 06 Jan 2016 22:57:51 GMT


On 1/6/2016 5:17 PM, Richard Eckart de Castilho wrote:
> On 06.01.2016, at 22:49, Marshall Schor <msa@schor.com> wrote:
>> In pseudocode - what's done for these changes is to send
>> (1) an int : the number of changes following
>> (2) for each change: an int representing the address into the aux heap of the item
>> (3) for each change: a byte/short/int represent the value
>>
>> The bug is in line 2: for the short and long arrays, this is sent as a "short"
>> instead of as an int; for the byte (also used for boolean arrays), this is sent
>> as an "int" which I think is correct.
>>
>> This means that serialization will give wrong results if there's a change to
>> some item in the short or long aux heaps which is indexed beyond 32767 items.
> Which form are we talking about? Form 4?
no - it's only for form 0 (just kidding - it's for **non-compressed** binary
serialization.
> Will it break all serialized data, only form 4 data (i.e. not form 6), or
> only form 4 with delta tracking enabled (assuming this is something that 
> must be specifically enabled - never used delta CAS).
None of these, only plain non-compressed binary serialization used in "delta
cas" mode.
>
>> I think this should be fixed; but it will "break" compatibility with any stored
>> existing serialized form, and furthermore, for UIMA-AS transport use, both the
>> client and the server will need coordinated updates.
> If it is a bug, +1 for fixing.
>
>> As a minimum, this should probably check to see if the error would be occurring
>> (trying to serialize some change at slot > 32767, and throw an exception.
>>
>> If we change this to use write an "int" (instead of short), we could add a
>> global configuration flag to disable this, too, if needed for some backward
>> compatibility purpose.
> I am starting to get suspicious of global flags for backwards compatibility.
> E.g. since ALLOW_DUP_ADD_TO_INDEXES was introduced, we have people complaining
> about a performance drop. ALLOW_DUP_ADD_TO_INDEXES can only be enabled/disabled
> globally, but not specifically for individual indexes. Neither can it be 
> temporarily disabled, e.g. during deserialization or other bulk operations.
> I wonder if local getters/setters or ThreadLocal variables initialized by
> a global setting wouldn't be a more appropriate option.

I was unaware of the performance issue; I may have missed some emails...  Can
you say how significant it is?  If there were no performance issue, would the
additional function be needed?

I assume the performance drop is when duplicates are not allowed (the new
default), and some users are wanting to restore the previous performance by
turning on ALLOW_DUP ....  Is this correct?

Cheers. -Marshall


> Best,
>
> -- Richard


Mime
View raw message