uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Kl├╝gl <peter.klu...@averbis.com>
Subject Re: opinion on degree of backwards compatibility for Uima V3 experiment
Date Fri, 02 Sep 2016 09:36:34 GMT
Same here.

It looks like that we are now also starting to use the address, and I am
also thinking of using it more in Ruta (internal indexing).

Btw, I did some simple experiments lately concerning the stability of
the addresses when using CasIOUtils. Can it happens that the addresses
change if you just deserialize the same CAs twice without serializing it
in between?



Am 01.09.2016 um 19:29 schrieb Richard Eckart de Castilho:
> FS IDs are IMHO a very useful thing. Providing out-of-band (i.e. out-of-type-system)
unique identifiers for feature structures facilitates handling them in e.g. in editors. We
use that quite a bit in WebAnno.
> In WebAnno, we do not rely on any heap arithmetics - an ID is just expected to be a unique
identifier. However, I could imagine cases where people might rely on the ID to increment
monotonically for new FSes.
> Most binary formats do not preserve the ID across a save/load cycle. However, SERIALIZED
and SERIALIZED_TSI *do* preserve the ID, and WebAnno makes used of that. It allows to keep
references to FSes without having to keep the CAS in memory all the time. 
> There should continue to be a V3 serialization format which preserves IDs across a load/save
> I do presently not see a case where a strong similarity between V2 and V3 IDs would be
important. It would be nice if deserializing a V2 SERIALIZED or SERIALIZED_TSI into V3 would
restore the V2 IDs - I expect it to be an easy thing to do.
> Cheers,
> -- Richard
>> On 01.09.2016, at 16:09, Marshall Schor <msa@schor.com> wrote:
>> UIMA V3 implementation includes in many places extra code (takes time / space)
>> whose goal is to make things look closer to version 2.  Some of this is for
>> interoperability with version 2 artifacts, like serialized forms.
>> An example: in v2, many serialization forms include "references" to other
>> Feature Structures (FSs), and for those, the encoding is the "address" in the
>> heap of the FS.
>> In v3, there is no heap, but the FSs have "ids", which are (at the moment) an
>> int which increments by 1.  This mis-matches the "address" in v2, so many parts
>> of the serialization code builds a map at serialization time from the v3 id's to
>> v2 "addresses", and uses the latter in the serialization form.
>> Currently, this is done for various binary serializations, so that these can be
>> read back in by v2 code.
>> Currently, it's not done for JSON or XMI (and maybe XCAS - haven't checked).  So
>> the serialized forms for these differ between v2 and v3, in that the numbers
>> used to represent references to other FSs are different.
>> The deserialization code for XMI and JSON doesn't depend on these numbers being
>> anything other than unique per FS, so there's no issue in deserializing.  But
>> the UIMA community may have built other things that depend on these identifiers
>> not changing. 
>> What's your opinion: should the XMI and JSON etc serialization in V3 be changed
>> to reproduce (approximately) the same reference numbers as v2?  I say
>> approximately, because other factors might affect these, such as the ordering
>> for things not in "ordered" indexes, etc. between v2 and v3.
>> -Marshall

View raw message