uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Klügl <peter.klu...@averbis.com>
Subject Re: CasIOUtils class - some meta-questions
Date Tue, 02 Aug 2016 11:40:14 GMT
Hi,


the errors where on my side. Reading the CASes created by the unit test
of CasIOUtils with uima 2.8.1 works fine now.


Can I do something else for this ticket?


Best,


Peter


Am 25.07.2016 um 08:43 schrieb Peter Klügl:
> Yeah, I know java serialization.
>
> I think it depends on the perspective and the use case. I added a header
> to the serialized outputs since I see them as binary fomats and I
> thought that all binary formats should get the same header. Then, I
> removed it again, then I added it again. I will remove it again now.
>
>
> I don't think that we will get an optimal solution, e.g., the header is
> read twice, the previous uimaj method should return the format and so
> on. We should get this up and running for the release without breaking
> backwards compatibility and then think what it should look like, and if
> further functionality/refactoring is required.
>
>
> I used uimaj-core 2.8.1. Here are some errors:
>
> simpleCas.bins0
> org.apache.uima.cas.CASRuntimeException: No sofaFS for specified sofaRef
> found.simpleCas.bins4
>     at org.apache.uima.cas.impl.CASImpl.getSofa(CASImpl.java:806)
>     at
> org.apache.uima.cas.impl.FSIndexRepositoryImpl.ll_addFS_common(FSIndexRepositoryImpl.java:2781)
>     at
> org.apache.uima.cas.impl.FSIndexRepositoryImpl.ll_addFS(FSIndexRepositoryImpl.java:2763)
>     at
> org.apache.uima.cas.impl.FSIndexRepositoryImpl.addFS(FSIndexRepositoryImpl.java:2068)
>     at org.apache.uima.cas.impl.CASImpl.reinitIndexedFSs(CASImpl.java:1765)
>     at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1488)
>     at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1344)
>     at
> org.apache.uima.cas.impl.Serialization.deserializeCAS(Serialization.java:171)
>     at tutorial.entity.LoadCas.main(LoadCas.java:55)
> org.apache.uima.cas.CASRuntimeException: Error trying to read BLOB data
> from an input stream and deserialize into a CAS.
>     at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1591)
>     at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1344)
>     at
> org.apache.uima.cas.impl.Serialization.deserializeCAS(Serialization.java:171)
>     at tutorial.entity.LoadCas.main(LoadCas.java:39)
>
> simpleCas.bins6
> java.io.EOFException
>     at java.io.DataInputStream.readUnsignedByte(DataInputStream.java:290)
>     at org.apache.uima.util.impl.DataIO.readVlong(DataIO.java:355)
>     at
> org.apache.uima.cas.impl.BinaryCasSerDes6.readVlong(BinaryCasSerDes6.java:2193)
>     at
> org.apache.uima.cas.impl.BinaryCasSerDes6.readDiff(BinaryCasSerDes6.java:2102)
>     at
> org.apache.uima.cas.impl.BinaryCasSerDes6.readLongOrDouble(BinaryCasSerDes6.java:2128)
>     at
> org.apache.uima.cas.impl.BinaryCasSerDes6.readByKind(BinaryCasSerDes6.java:1920)
>     at
> org.apache.uima.cas.impl.BinaryCasSerDes6.deserializeAfterVersion(BinaryCasSerDes6.java:1748)
>     at
> org.apache.uima.cas.impl.BinaryCasSerDes6.deserialize(BinaryCasSerDes6.java:1596)
>     at
> org.apache.uima.cas.impl.Serialization.deserializeCAS(Serialization.java:270)
>     at tutorial.entity.LoadCas.main(LoadCas.java:47)
>
>
>
> Am 22.07.2016 um 21:17 schrieb Marshall Schor:
>> I think the model for these two formats is more general than what you are
>> imagining.  These are formats that follow the standard Java serialization
>> standard, see for example,
>> https://docs.oracle.com/javase/7/docs/platform/serialization/spec/serialTOC.html
>>
>> The bytes corresponding to the serialized form are expected to (in general) be
>> written anywhere in a data output stream, perhaps preceded or followed by (maybe
>> many) other serialized objects; the overall format of that stream is up to the
>> user designing it, including any headers the user might decide on.
>>
>> In the data output stream, each data object, including one representing the CAS,
>> for example, has a format dictated by the Java standard for object serialization.
>>
>> What error do you get when you try to deserialize a CAS object in a data stream
>> with an older version of UIMA?
>>
>> -Marshall
>>
>> On 7/22/2016 9:31 AM, Peter Klügl wrote:
>>> So SERIALIZED and SERIALIZED_TS get no header?
>>>
>>>
>>> Can you try to deserialize the CAS files created by the unit test with
>>> an older version of uima? I cannot get it to work.
>>>
>>>
>>> Best,
>>>
>>>
>>> Peter
>>>
>>>
>>> Am 22.07.2016 um 15:18 schrieb Marshall Schor:
>>>> Re: The java-serialized formats now have also a binary header
>>>>
>>>> Not sure what you mean by java-serialized formats.  Perhaps this means the
>>>> formats created by using standard Java Object serialization on the special
>>>> objects in UIMA built for this.
>>>>
>>>> If so, then it seems this would break backwards compatibility, in that a
user
>>>> serializing with UIMA 2.9.0, but not using any new features, could not have
that
>>>> "read" by an older version of UIMA.
>>>>
>>>>
>>>> -Marshall
>>>>
>>>> On 7/22/2016 7:43 AM, Peter Klügl wrote:
>>>>> Hi,
>>>>>
>>>>>
>>>>> I changed CasIOUtils to use the Header and I extended the header with
a
>>>>> bit (0x08) indicating an included type system. No information about the
>>>>> serialization of the type system yet. The java-serialized formats now
>>>>> have also a binary header as I did not want to make the header
>>>>> serializable as it should be read/written by the same functionality.
>>>>>
>>>>> I have thought that old UIMA versions (e.g., 2.8.1) should be able to
>>>>> load new CAS files, but my tests failed.  No idea yet why. I am overall
>>>>> not very happy with the current solution, but I could live with it.
>>>>>
>>>>> Maybe someone wants to take a look at it?
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>> Peter
>>>>>
>>>>> Am 20.07.2016 um 14:30 schrieb Peter Klügl:
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> I'll try to find the time to do these changes this week, next week
latest.
>>>>>>
>>>>>>
>>>>>> btw, input stream sniffing in order to distinguish XMI and XCAS is
>>>>>> currently not supported. There could be a lot of text before the
>>>>>> relevant element occurs, e.g., license text.
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>>
>>>>>> Am 20.07.2016 um 14:19 schrieb Marshall Schor:
>>>>>>> Hi,
>>>>>>>
>>>>>>> We can change the header, but:
>>>>>>>
>>>>>>> The changed header ought to be "readable" by previous versions
of UIMA.  
>>>>>>>
>>>>>>> For XMI and XCAS, these do not currently have special headers,
and if we added
>>>>>>> these, those formats could not be read by older versions of UIMA.
 Those formats
>>>>>>> contain sufficient distinguishing initial strings to distinguish
them, though. 
>>>>>>>
>>>>>>> The XMI format is specified, also, in an OASIS standard which
the UIMA project
>>>>>>> is said to (mostly) follow: http://uima.apache.org/uima-specification.html
>>>>>>>
>>>>>>> For binary serializations, I think there's room in the header
for an extra bit,
>>>>>>> which if on, could indicate that a type system was included.
 I think it would
>>>>>>> be good to have a header extension, when type systems are included,
to specify
>>>>>>> the format and version of the type system serialization.
>>>>>>>
>>>>>>> Most serializations in core UIMA have not included the type system.
 The one
>>>>>>> which does is CASCompleteSerializer.  This is  a "serializable"
(using standard
>>>>>>> Java serializations) object containing serializable forms of
the CAS and Type
>>>>>>> System.
>>>>>>>
>>>>>>> Regarding making methods in CommonSerDes public:
>>>>>>>
>>>>>>> It is fine to make them public in the sense that they are accessible
from other
>>>>>>> packages, not in a sub-type hierarchy.  But I think it is best
to not include
>>>>>>> CommonSerDes in a package which is intended for end-users, because
the end user
>>>>>>> UIMA APIs should be (as much as possible) stable over a long
time period. 
>>>>>>> Details of how we evolve headers, etc., should not disturb end
users, if
>>>>>>> possible; keeping these as public but in packages with names
like xxx.impl or
>>>>>>> xyz.internal.abc etc. is the way this has been traditionally
done.  It allows us
>>>>>>> to evolve these without affecting end-user APIs.  
>>>>>>>
>>>>>>> Just to be clear: I would not consider uimaFIT and Ruta to be
"end-users", as
>>>>>>> they are developed within the UIMA project, and we are willing
to evolve them
>>>>>>> together with UIMA core changes.
>>>>>>>
>>>>>>> We don't have a deadline for the next release, but it's mostly
ready to go, and
>>>>>>> will solve a significant issue for people wanting to upgrade
their Eclipse to
>>>>>>> Neon :-). 
>>>>>>>
>>>>>>> -Marshall
>>>>>>>
>>>>>>> On 7/20/2016 5:03 AM, Peter Klügl wrote:
>>>>>>>> Ok, after looking at the code I must admit that there is
much more to do
>>>>>>>> than I epxected. We first need to discuss several things:
>>>>>>>>
>>>>>>>> - can we change the header at all?
>>>>>>>>
>>>>>>>> - do we support type system inclusion in the header?
>>>>>>>>
>>>>>>>> - do we support type system inclusion in the serialized files?
>>>>>>>>
>>>>>>>> - which serial format are which ones?
>>>>>>>>
>>>>>>>> - can we make the methods in CommonSerDes public?
>>>>>>>>
>>>>>>>>
>>>>>>>> What is the deadline for the release? I am now quite loaded
with work
>>>>>>>> until next Wednesday :-(
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>>
>>>>>>>> Peter
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 19.07.2016 um 22:39 schrieb Marshall Schor:
>>>>>>>>> Great.
>>>>>>>>>
>>>>>>>>> There's now also common code for writing / reading UIMA
serialization headers, in
>>>>>>>>>
>>>>>>>>> CommonSerDes (in org.apache.uima.cas.impl )
>>>>>>>>>
>>>>>>>>> This includes the extensions to support versioning the
serializations, which
>>>>>>>>> start to be needed in the next release because a bug
fix is slightly changing
>>>>>>>>> the serialized form for **delta binary** CAS.
>>>>>>>>>
>>>>>>>>> So, it would be good to use that rather than have another
separate header
>>>>>>>>> reader/writer to maintain.
>>>>>>>>>
>>>>>>>>> -Marshall
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 7/19/2016 4:13 PM, Peter Klügl wrote:
>>>>>>>>>> Ah, I didn't know that enum. I'll adapt the code
and enum.
>>>>>>>>>>
>>>>>>>>>> Am 19.07.2016 um 20:09 schrieb Marshall Schor:
>>>>>>>>>>> We already have an enum in the core for various
serial formats.  The class is
>>>>>>>>>>>
>>>>>>>>>>> public enum SerialFormat {
>>>>>>>>>>>    UNKNOWN,
>>>>>>>>>>>    XCAS,         // with reachability filtering
>>>>>>>>>>>    XMI,          // with reachability filtering
>>>>>>>>>>>    BINARY,       // no filtering
>>>>>>>>>>>    COMPRESSED,   // no filtering  (form 4)
>>>>>>>>>>>    COMPRESSED_FILTERED,   // with reachability
and type and feature filtering
>>>>>>>>>>> (form 6)
>>>>>>>>>>>    COMPRESSED_PROJECTION, // with subset of views
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> (I don't think COMPRESSED_PROJECTION is in use...)
>>>>>>>>>>>
>>>>>>>>>>> This has been around for maybe 3 years.  I would
be in favor of considering
>>>>>>>>>>> using and/or extending this as needed, rather
than having two formats (that is,
>>>>>>>>>>> the proposed SerializationFormat class).
>>>>>>>>>>>
>>>>>>>>>>> -Marshall
>>>>>>>>>>>
>>>>>>>>>>> On 7/19/2016 2:49 AM, Peter Klügl wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> yes, the class should be officially available
to external code. I
>>>>>>>>>>>> already included it in the CAS Editor and
in Ruta. I also plan to use it
>>>>>>>>>>>> in our inhouse code. I'll change the enforcer
rule.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I can write the docs but any help is welcome
since I do not know how
>>>>>>>>>>>> much spare time I have for the rest of the
week for this. I'll take a
>>>>>>>>>>>> look where the documentation should be added.
Haven't looked to it for
>>>>>>>>>>>> some time ;-)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I just chose the name of the class Richard
contributed since I thought
>>>>>>>>>>>> it is really suitable. Then, I also noticed
the uimaFIT class. This is a
>>>>>>>>>>>> not really good situation, but I would not
change the name because of it.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I would not split the API form the implementation.
I do not see any
>>>>>>>>>>>> advantages right now. The class is just a
simple utils class with only
>>>>>>>>>>>> static methods like CasCreationUtils (which
is also not separated).
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>>
>>>>>>>>>>>> Peter
>>>>>>>>>>>>
>>>>>>>>>>>> Am 18.07.2016 um 22:26 schrieb Marshall Schor:
>>>>>>>>>>>>> This is OK with me.  I can even volunteer
to write the docs (but am happy to
>>>>>>>>>>>>> others do it :-) ).
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'll wait to hear about the split (if
any) between the public API and the
>>>>>>>>>>>>> impl.
>>>>>>>>>>>>>
>>>>>>>>>>>>> And, we'll need to change the next version
# to 2.9.0, from 2.8.2, due to this
>>>>>>>>>>>>> being that kind of a change.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is everyone OK with all of this?
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Marshall
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 7/18/2016 2:39 PM, Richard Eckart
de Castilho wrote:
>>>>>>>>>>>>>> I believe the intention is that this
class becomes part of the public API.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also, my understanding is that it
would do a superset of what the
>>>>>>>>>>>>>> uimaFIT class by the same name does.
We could then probably deprecate
>>>>>>>>>>>>>> the respective uimaFIT class and
suggest using the core class instead.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -- Richard
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 18.07.2016, at 20:30, Marshall
Schor <msa@schor.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is a new class added to
uimaj-core project, in org.apache.uima.util
>>>>>>>>>>>>>>> package.  This is fine if this
is to be part of the official public APIs
>>>>>>>>>>>>>>> supported by UIMA going forward;
but if that is the case, it should
>>>>>>>>>>>>>>> probably be
>>>>>>>>>>>>>>> documented in the UIMA docs,
and we'd have to change the version number
>>>>>>>>>>>>>>> (due to
>>>>>>>>>>>>>>> enforcer rules).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If this is more of an internal
use utilities, then it should be in one of
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> internal use packages, such as
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    org.apache.uima.internal.util
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This class is similarly named
to a UIMAFit class; are these related?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If some of the APIs are to be
permanent and public and part of the official
>>>>>>>>>>>>>>> public APIs, but some are internal
implementation details, please
>>>>>>>>>>>>>>> consider using
>>>>>>>>>>>>>>> an interface and an ".impl" (or
equivalent) approach; packages which support
>>>>>>>>>>>>>>> these are:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    org.apache.uima.util  and
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    org.apache.uima.util.impl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If this is only an internal kind
of change, not intending to affect the
>>>>>>>>>>>>>>> official
>>>>>>>>>>>>>>> UIMA APIs, then moving to the
internal.util package will fix the "enforcer"
>>>>>>>>>>>>>>> error the build is currently
getting.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -Marshall
>>>>>>>>>>>>>>>


Mime
View raw message