uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: CasIOUtils class - some meta-questions
Date Tue, 02 Aug 2016 20:49:30 GMT
I'll take a look now; thanks for the work!

-Marshall

On 8/2/2016 7:40 AM, Peter Klügl wrote:
> Hi,
>
>
> the errors where on my side. Reading the CASes created by the unit test
> of CasIOUtils with uima 2.8.1 works fine now.
>
>
> Can I do something else for this ticket?
>
>
> Best,
>
>
> Peter
>
>
> Am 25.07.2016 um 08:43 schrieb Peter Klügl:
>> Yeah, I know java serialization.
>>
>> I think it depends on the perspective and the use case. I added a header
>> to the serialized outputs since I see them as binary fomats and I
>> thought that all binary formats should get the same header. Then, I
>> removed it again, then I added it again. I will remove it again now.
>>
>>
>> I don't think that we will get an optimal solution, e.g., the header is
>> read twice, the previous uimaj method should return the format and so
>> on. We should get this up and running for the release without breaking
>> backwards compatibility and then think what it should look like, and if
>> further functionality/refactoring is required.
>>
>>
>> I used uimaj-core 2.8.1. Here are some errors:
>>
>> simpleCas.bins0
>> org.apache.uima.cas.CASRuntimeException: No sofaFS for specified sofaRef
>> found.simpleCas.bins4
>>     at org.apache.uima.cas.impl.CASImpl.getSofa(CASImpl.java:806)
>>     at
>> org.apache.uima.cas.impl.FSIndexRepositoryImpl.ll_addFS_common(FSIndexRepositoryImpl.java:2781)
>>     at
>> org.apache.uima.cas.impl.FSIndexRepositoryImpl.ll_addFS(FSIndexRepositoryImpl.java:2763)
>>     at
>> org.apache.uima.cas.impl.FSIndexRepositoryImpl.addFS(FSIndexRepositoryImpl.java:2068)
>>     at org.apache.uima.cas.impl.CASImpl.reinitIndexedFSs(CASImpl.java:1765)
>>     at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1488)
>>     at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1344)
>>     at
>> org.apache.uima.cas.impl.Serialization.deserializeCAS(Serialization.java:171)
>>     at tutorial.entity.LoadCas.main(LoadCas.java:55)
>> org.apache.uima.cas.CASRuntimeException: Error trying to read BLOB data
>> from an input stream and deserialize into a CAS.
>>     at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1591)
>>     at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1344)
>>     at
>> org.apache.uima.cas.impl.Serialization.deserializeCAS(Serialization.java:171)
>>     at tutorial.entity.LoadCas.main(LoadCas.java:39)
>>
>> simpleCas.bins6
>> java.io.EOFException
>>     at java.io.DataInputStream.readUnsignedByte(DataInputStream.java:290)
>>     at org.apache.uima.util.impl.DataIO.readVlong(DataIO.java:355)
>>     at
>> org.apache.uima.cas.impl.BinaryCasSerDes6.readVlong(BinaryCasSerDes6.java:2193)
>>     at
>> org.apache.uima.cas.impl.BinaryCasSerDes6.readDiff(BinaryCasSerDes6.java:2102)
>>     at
>> org.apache.uima.cas.impl.BinaryCasSerDes6.readLongOrDouble(BinaryCasSerDes6.java:2128)
>>     at
>> org.apache.uima.cas.impl.BinaryCasSerDes6.readByKind(BinaryCasSerDes6.java:1920)
>>     at
>> org.apache.uima.cas.impl.BinaryCasSerDes6.deserializeAfterVersion(BinaryCasSerDes6.java:1748)
>>     at
>> org.apache.uima.cas.impl.BinaryCasSerDes6.deserialize(BinaryCasSerDes6.java:1596)
>>     at
>> org.apache.uima.cas.impl.Serialization.deserializeCAS(Serialization.java:270)
>>     at tutorial.entity.LoadCas.main(LoadCas.java:47)
>>
>>
>>
>> Am 22.07.2016 um 21:17 schrieb Marshall Schor:
>>> I think the model for these two formats is more general than what you are
>>> imagining.  These are formats that follow the standard Java serialization
>>> standard, see for example,
>>> https://docs.oracle.com/javase/7/docs/platform/serialization/spec/serialTOC.html
>>>
>>> The bytes corresponding to the serialized form are expected to (in general) be
>>> written anywhere in a data output stream, perhaps preceded or followed by (maybe
>>> many) other serialized objects; the overall format of that stream is up to the
>>> user designing it, including any headers the user might decide on.
>>>
>>> In the data output stream, each data object, including one representing the CAS,
>>> for example, has a format dictated by the Java standard for object serialization.
>>>
>>> What error do you get when you try to deserialize a CAS object in a data stream
>>> with an older version of UIMA?
>>>
>>> -Marshall
>>>
>>> On 7/22/2016 9:31 AM, Peter Klügl wrote:
>>>> So SERIALIZED and SERIALIZED_TS get no header?
>>>>
>>>>
>>>> Can you try to deserialize the CAS files created by the unit test with
>>>> an older version of uima? I cannot get it to work.
>>>>
>>>>
>>>> Best,
>>>>
>>>>
>>>> Peter
>>>>
>>>>
>>>> Am 22.07.2016 um 15:18 schrieb Marshall Schor:
>>>>> Re: The java-serialized formats now have also a binary header
>>>>>
>>>>> Not sure what you mean by java-serialized formats.  Perhaps this means
the
>>>>> formats created by using standard Java Object serialization on the special
>>>>> objects in UIMA built for this.
>>>>>
>>>>> If so, then it seems this would break backwards compatibility, in that
a user
>>>>> serializing with UIMA 2.9.0, but not using any new features, could not
have that
>>>>> "read" by an older version of UIMA.
>>>>>
>>>>>
>>>>> -Marshall
>>>>>
>>>>> On 7/22/2016 7:43 AM, Peter Klügl wrote:
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> I changed CasIOUtils to use the Header and I extended the header
with a
>>>>>> bit (0x08) indicating an included type system. No information about
the
>>>>>> serialization of the type system yet. The java-serialized formats
now
>>>>>> have also a binary header as I did not want to make the header
>>>>>> serializable as it should be read/written by the same functionality.
>>>>>>
>>>>>> I have thought that old UIMA versions (e.g., 2.8.1) should be able
to
>>>>>> load new CAS files, but my tests failed.  No idea yet why. I am overall
>>>>>> not very happy with the current solution, but I could live with it.
>>>>>>
>>>>>> Maybe someone wants to take a look at it?
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>> Am 20.07.2016 um 14:30 schrieb Peter Klügl:
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>> I'll try to find the time to do these changes this week, next
week latest.
>>>>>>>
>>>>>>>
>>>>>>> btw, input stream sniffing in order to distinguish XMI and XCAS
is
>>>>>>> currently not supported. There could be a lot of text before
the
>>>>>>> relevant element occurs, e.g., license text.
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>>
>>>>>>> Peter
>>>>>>>
>>>>>>>
>>>>>>> Am 20.07.2016 um 14:19 schrieb Marshall Schor:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We can change the header, but:
>>>>>>>>
>>>>>>>> The changed header ought to be "readable" by previous versions
of UIMA.  
>>>>>>>>
>>>>>>>> For XMI and XCAS, these do not currently have special headers,
and if we added
>>>>>>>> these, those formats could not be read by older versions
of UIMA.  Those formats
>>>>>>>> contain sufficient distinguishing initial strings to distinguish
them, though. 
>>>>>>>>
>>>>>>>> The XMI format is specified, also, in an OASIS standard which
the UIMA project
>>>>>>>> is said to (mostly) follow: http://uima.apache.org/uima-specification.html
>>>>>>>>
>>>>>>>> For binary serializations, I think there's room in the header
for an extra bit,
>>>>>>>> which if on, could indicate that a type system was included.
 I think it would
>>>>>>>> be good to have a header extension, when type systems are
included, to specify
>>>>>>>> the format and version of the type system serialization.
>>>>>>>>
>>>>>>>> Most serializations in core UIMA have not included the type
system.  The one
>>>>>>>> which does is CASCompleteSerializer.  This is  a "serializable"
(using standard
>>>>>>>> Java serializations) object containing serializable forms
of the CAS and Type
>>>>>>>> System.
>>>>>>>>
>>>>>>>> Regarding making methods in CommonSerDes public:
>>>>>>>>
>>>>>>>> It is fine to make them public in the sense that they are
accessible from other
>>>>>>>> packages, not in a sub-type hierarchy.  But I think it is
best to not include
>>>>>>>> CommonSerDes in a package which is intended for end-users,
because the end user
>>>>>>>> UIMA APIs should be (as much as possible) stable over a long
time period. 
>>>>>>>> Details of how we evolve headers, etc., should not disturb
end users, if
>>>>>>>> possible; keeping these as public but in packages with names
like xxx.impl or
>>>>>>>> xyz.internal.abc etc. is the way this has been traditionally
done.  It allows us
>>>>>>>> to evolve these without affecting end-user APIs.  
>>>>>>>>
>>>>>>>> Just to be clear: I would not consider uimaFIT and Ruta to
be "end-users", as
>>>>>>>> they are developed within the UIMA project, and we are willing
to evolve them
>>>>>>>> together with UIMA core changes.
>>>>>>>>
>>>>>>>> We don't have a deadline for the next release, but it's mostly
ready to go, and
>>>>>>>> will solve a significant issue for people wanting to upgrade
their Eclipse to
>>>>>>>> Neon :-). 
>>>>>>>>
>>>>>>>> -Marshall
>>>>>>>>
>>>>>>>> On 7/20/2016 5:03 AM, Peter Klügl wrote:
>>>>>>>>> Ok, after looking at the code I must admit that there
is much more to do
>>>>>>>>> than I epxected. We first need to discuss several things:
>>>>>>>>>
>>>>>>>>> - can we change the header at all?
>>>>>>>>>
>>>>>>>>> - do we support type system inclusion in the header?
>>>>>>>>>
>>>>>>>>> - do we support type system inclusion in the serialized
files?
>>>>>>>>>
>>>>>>>>> - which serial format are which ones?
>>>>>>>>>
>>>>>>>>> - can we make the methods in CommonSerDes public?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> What is the deadline for the release? I am now quite
loaded with work
>>>>>>>>> until next Wednesday :-(
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Am 19.07.2016 um 22:39 schrieb Marshall Schor:
>>>>>>>>>> Great.
>>>>>>>>>>
>>>>>>>>>> There's now also common code for writing / reading
UIMA serialization headers, in
>>>>>>>>>>
>>>>>>>>>> CommonSerDes (in org.apache.uima.cas.impl )
>>>>>>>>>>
>>>>>>>>>> This includes the extensions to support versioning
the serializations, which
>>>>>>>>>> start to be needed in the next release because a
bug fix is slightly changing
>>>>>>>>>> the serialized form for **delta binary** CAS.
>>>>>>>>>>
>>>>>>>>>> So, it would be good to use that rather than have
another separate header
>>>>>>>>>> reader/writer to maintain.
>>>>>>>>>>
>>>>>>>>>> -Marshall
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 7/19/2016 4:13 PM, Peter Klügl wrote:
>>>>>>>>>>> Ah, I didn't know that enum. I'll adapt the code
and enum.
>>>>>>>>>>>
>>>>>>>>>>> Am 19.07.2016 um 20:09 schrieb Marshall Schor:
>>>>>>>>>>>> We already have an enum in the core for various
serial formats.  The class is
>>>>>>>>>>>>
>>>>>>>>>>>> public enum SerialFormat {
>>>>>>>>>>>>    UNKNOWN,
>>>>>>>>>>>>    XCAS,         // with reachability filtering
>>>>>>>>>>>>    XMI,          // with reachability filtering
>>>>>>>>>>>>    BINARY,       // no filtering
>>>>>>>>>>>>    COMPRESSED,   // no filtering  (form 4)
>>>>>>>>>>>>    COMPRESSED_FILTERED,   // with reachability
and type and feature filtering
>>>>>>>>>>>> (form 6)
>>>>>>>>>>>>    COMPRESSED_PROJECTION, // with subset
of views
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> (I don't think COMPRESSED_PROJECTION is in
use...)
>>>>>>>>>>>>
>>>>>>>>>>>> This has been around for maybe 3 years. 
I would be in favor of considering
>>>>>>>>>>>> using and/or extending this as needed, rather
than having two formats (that is,
>>>>>>>>>>>> the proposed SerializationFormat class).
>>>>>>>>>>>>
>>>>>>>>>>>> -Marshall
>>>>>>>>>>>>
>>>>>>>>>>>> On 7/19/2016 2:49 AM, Peter Klügl wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> yes, the class should be officially available
to external code. I
>>>>>>>>>>>>> already included it in the CAS Editor
and in Ruta. I also plan to use it
>>>>>>>>>>>>> in our inhouse code. I'll change the
enforcer rule.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I can write the docs but any help is
welcome since I do not know how
>>>>>>>>>>>>> much spare time I have for the rest of
the week for this. I'll take a
>>>>>>>>>>>>> look where the documentation should be
added. Haven't looked to it for
>>>>>>>>>>>>> some time ;-)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I just chose the name of the class Richard
contributed since I thought
>>>>>>>>>>>>> it is really suitable. Then, I also noticed
the uimaFIT class. This is a
>>>>>>>>>>>>> not really good situation, but I would
not change the name because of it.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would not split the API form the implementation.
I do not see any
>>>>>>>>>>>>> advantages right now. The class is just
a simple utils class with only
>>>>>>>>>>>>> static methods like CasCreationUtils
(which is also not separated).
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am 18.07.2016 um 22:26 schrieb Marshall
Schor:
>>>>>>>>>>>>>> This is OK with me.  I can even volunteer
to write the docs (but am happy to
>>>>>>>>>>>>>> others do it :-) ).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'll wait to hear about the split
(if any) between the public API and the
>>>>>>>>>>>>>> impl.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And, we'll need to change the next
version # to 2.9.0, from 2.8.2, due to this
>>>>>>>>>>>>>> being that kind of a change.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is everyone OK with all of this?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Marshall
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 7/18/2016 2:39 PM, Richard Eckart
de Castilho wrote:
>>>>>>>>>>>>>>> I believe the intention is that
this class becomes part of the public API.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also, my understanding is that
it would do a superset of what the
>>>>>>>>>>>>>>> uimaFIT class by the same name
does. We could then probably deprecate
>>>>>>>>>>>>>>> the respective uimaFIT class
and suggest using the core class instead.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -- Richard
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 18.07.2016, at 20:30,
Marshall Schor <msa@schor.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This is a new class added
to uimaj-core project, in org.apache.uima.util
>>>>>>>>>>>>>>>> package.  This is fine if
this is to be part of the official public APIs
>>>>>>>>>>>>>>>> supported by UIMA going forward;
but if that is the case, it should
>>>>>>>>>>>>>>>> probably be
>>>>>>>>>>>>>>>> documented in the UIMA docs,
and we'd have to change the version number
>>>>>>>>>>>>>>>> (due to
>>>>>>>>>>>>>>>> enforcer rules).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If this is more of an internal
use utilities, then it should be in one of
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> internal use packages, such
as
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    org.apache.uima.internal.util
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This class is similarly named
to a UIMAFit class; are these related?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If some of the APIs are to
be permanent and public and part of the official
>>>>>>>>>>>>>>>> public APIs, but some are
internal implementation details, please
>>>>>>>>>>>>>>>> consider using
>>>>>>>>>>>>>>>> an interface and an ".impl"
(or equivalent) approach; packages which support
>>>>>>>>>>>>>>>> these are:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    org.apache.uima.util 
and
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    org.apache.uima.util.impl
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If this is only an internal
kind of change, not intending to affect the
>>>>>>>>>>>>>>>> official
>>>>>>>>>>>>>>>> UIMA APIs, then moving to
the internal.util package will fix the "enforcer"
>>>>>>>>>>>>>>>> error the build is currently
getting.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -Marshall
>>>>>>>>>>>>>>>>
>


Mime
View raw message