uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: CasIOUtils class - some meta-questions
Date Thu, 04 Aug 2016 15:08:17 GMT
I'm taking a try at the general documentation for this class; here's what I have
(written from the point of view of being useful to new users of this class).

CasIOUtils is a collection of static methods aimed at making it easy to
  - save and load CASes, and to
  - optionally include their Type Systems and index definitions based on those
type systems (abbreviated TSI). 

There are several serialization formats supported; these are listed in the Java
enum SerialFormat, together with their preferred file extension name. 

The APIs for loading attempt to automatically use the appropriate deserializers,
based on the input data format.  To select the right deserializer, first, the
file extension name (if available) is used:
  - xmi: XMI format
  - xcas: XCAS format
  - xml: XCAS format

If none of these apply, then the first few bytes of the input are examined to
determine the format.

For loading, the inputs may be supplied as URLs or as InputStream.  You can use
Files or Paths by converting these to URLs:
   URL url = a_path.toUri().toURL();
   URL url = a_file.toUri().toURL();

When loading, an optional lenient boolean flag may be specified; if true, then
types and/or features being deserialized which don't exist in the receiving CAS
are silently ignored.

When TSI is saved, it is either saved in the same destination (e.g. file or
stream), or in a separate one. 
  - Two serialization formats support saving the TSI in the same destination: 
    -- SERIALIZED_TS and
Other formats require the TSI to be saved to a separate OutputStream.

Summary of the APIs for saving:

  save(CAS, OutputStream, SerialFormat)
  save(CAS, OutputStream, OutputStream, SerialFormat)  - extra outputStream for
saving the TSI

Summary of APIs for loading:
 load(URL        , CAS)
 load(InputStream, CAS)  

 load(URL        , URL        , CAS, lenient_flag)   - the second URL is for
loading a separately-stored TSI
  load(InputStream, InputStream, CAS, lenient_flag

You may specify the lenient_flag without the TSI input by setting the 2nd
argument to null.

To make this documentation correct, the impl needs some slight adjustments:

The method for reading the first few bytes of input to determine the format: 
should look for XCAS format explicitly (e.g., load the first 10,000 bytes and
search for <CAS> as the first XML element?) and maybe handle it.

Make the load with non-null TSI input work for all formats (currently silently
ignored for xmi, xcas)



View raw message