uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: opinion on degree of backwards compatibility for Uima V3 experiment
Date Fri, 02 Sep 2016 13:26:17 GMT
Re: deserializing the same CAS twice shouldn't change the addresses;  if you
have a case where it's doing that, I'll investigate (need a small test case...).


On 9/2/2016 5:36 AM, Peter Kl├╝gl wrote:
> Same here.
> It looks like that we are now also starting to use the address, and I am
> also thinking of using it more in Ruta (internal indexing).
> Btw, I did some simple experiments lately concerning the stability of
> the addresses when using CasIOUtils. Can it happens that the addresses
> change if you just deserialize the same CAs twice without serializing it
> in between?
> Best,
> Peter
> Am 01.09.2016 um 19:29 schrieb Richard Eckart de Castilho:
>> FS IDs are IMHO a very useful thing. Providing out-of-band (i.e. out-of-type-system)
unique identifiers for feature structures facilitates handling them in e.g. in editors. We
use that quite a bit in WebAnno.
>> In WebAnno, we do not rely on any heap arithmetics - an ID is just expected to be
a unique identifier. However, I could imagine cases where people might rely on the ID to increment
monotonically for new FSes.
>> Most binary formats do not preserve the ID across a save/load cycle. However, SERIALIZED
and SERIALIZED_TSI *do* preserve the ID, and WebAnno makes used of that. It allows to keep
references to FSes without having to keep the CAS in memory all the time. 
>> There should continue to be a V3 serialization format which preserves IDs across
a load/save cycle. 
>> I do presently not see a case where a strong similarity between V2 and V3 IDs would
be important. It would be nice if deserializing a V2 SERIALIZED or SERIALIZED_TSI into V3
would restore the V2 IDs - I expect it to be an easy thing to do.
>> Cheers,
>> -- Richard
>>> On 01.09.2016, at 16:09, Marshall Schor <msa@schor.com> wrote:
>>> UIMA V3 implementation includes in many places extra code (takes time / space)
>>> whose goal is to make things look closer to version 2.  Some of this is for
>>> interoperability with version 2 artifacts, like serialized forms.
>>> An example: in v2, many serialization forms include "references" to other
>>> Feature Structures (FSs), and for those, the encoding is the "address" in the
>>> heap of the FS.
>>> In v3, there is no heap, but the FSs have "ids", which are (at the moment) an
>>> int which increments by 1.  This mis-matches the "address" in v2, so many parts
>>> of the serialization code builds a map at serialization time from the v3 id's
>>> v2 "addresses", and uses the latter in the serialization form.
>>> Currently, this is done for various binary serializations, so that these can
>>> read back in by v2 code.
>>> Currently, it's not done for JSON or XMI (and maybe XCAS - haven't checked).
>>> the serialized forms for these differ between v2 and v3, in that the numbers
>>> used to represent references to other FSs are different.
>>> The deserialization code for XMI and JSON doesn't depend on these numbers being
>>> anything other than unique per FS, so there's no issue in deserializing.  But
>>> the UIMA community may have built other things that depend on these identifiers
>>> not changing. 
>>> What's your opinion: should the XMI and JSON etc serialization in V3 be changed
>>> to reproduce (approximately) the same reference numbers as v2?  I say
>>> approximately, because other factors might affect these, such as the ordering
>>> for things not in "ordered" indexes, etc. between v2 and v3.
>>> -Marshall

View raw message