uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thilo Goetz <twgo...@gmx.de>
Subject Re: CAS serialization performance: XMI vs. Java serialization
Date Wed, 15 Aug 2012 10:09:43 GMT
On 15/08/12 11:09, Richard Eckart de Castilho wrote:
> Am 15.08.2012 um 11:00 schrieb Thilo Goetz:
>> However, as I recall, there was a way you could serialize the CAS
>> without the type system if you were sure you didn't need it.  Isn't that
>> the difference between the CasCompleteSerializer and the
>> NotSoCompleteSerializer (making that up here)?  On the way back, you can
>> deserialize into an existing CAS that has the right type system.
> I tried the CasCompleteSerializer (in contrast to the CasSerializer) because I am not
sure what
> "the right type system" means. Afaik, on configuration of the type system, type internally
get assigned
> numeric IDs which are then used in the heap. I wasn't sure if these couldn't change between
> runs, even though the type system is technically the same.

If you serialize many CASes from the same UIMA pipeline, you only need
to serialize the type system once.  However, you do need to have a
serialized binary version of that type system.  The assignment of codes
to types and features is not deterministic and may vary between JVMs.

>> Your times above, do they include time needed to do the compression?
>> I'm surprised binary serialization is not even twice as fast.  Or is
>> this gated by the disk I/O?
> It currently includes gzip compression and is limited by disk i/o, since that's the scenario
I am faced with.
> For curiosity, I was planning to run the same test writing to a ByteArrayOutputStream
to see how much time
> the actual encoding takes. I was also surprised that it wasn't faster and in particular
that the file size
> wasn't much smaller.

The XMI compresses really well because it's mostly air ;-)  The binary
serialization is actually pretty wasteful, particularly for small CASes.
 This is because all data types other than strings are encoded as
integers and always take up 32 bits.  I don't know how well compression
handles that kind of scenario.  I also don't know how strings are
serialized in the binary serialization.  Is there a conversion to UTF-8?
 If not, it gets serialized as UTF-16, which also is a huge waste for
English text.  So I'm not so surprised by the file sizes.  But I would
have expected a bigger time difference.


> -- Richard

View raw message