uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: CAS serialization performance: XMI vs. Java serialization
Date Wed, 15 Aug 2012 15:21:06 GMT
As a side comment, in previous benchmarking I've done on other systems, I've
found that using memory mapped IO (part of Java NIO) can make a lot of difference.

Also, when we put in gzip we expected it to speed things up, but it actually
quite slowed things down.


On 8/15/2012 4:09 AM, Richard Eckart de Castilho wrote:
> Hi,
> I am looking for a way to improve loading times in an application, so I did a little
experiment with binary CAS serialization to see if it was superior to XMI serialization. For
serialization I used the CASCompleteSerializer to serialize the type-system and heaps into
the same file using Java object serialization - at least that is what I understood it should
do. To read in these files, I would deserialize the CASCompleteSerializer and initialize a
CAS from it using CASImpl.reinit().
> 96.400 files
> plain text (uncompressed)      :                 581.865.593 Byte
> binary (serialized java, gzip) : 0:47:02.835   3.555.449.597 Byte 
> xmi (gzip)                     : 1:20:31.535   4.712.633.769 Byte
> So binary takes about 60% of the time xmi serialization would need and uses about 75%
of the space.
> I didn't do reading experiment yet, but I suppose the improvement should be on a similar
level, if not better.
> I am also not sure yet about the draw-backs of binary serialization and in which scenarios
they apply. The draw-backs I saw so far are:
> - Type-system is stored redudantly in every output file.
> - The type system configured with CASImpl.reinit() may be different from the one which
was used to initialize the pipeline, CAS-based annotators relying on typeSystemInit() may
not be configured with the correct types - this is a hypothesis I didn't test.
> - Serialized Java objects may become due to refactoring within the UIMA framework. However,
there is yet another binary CAS serialization in UIMA which uses the DataOutputStream and
may be more stable.
> Did anybody ever use any form of binary CAS serialization outside Vinci/UIMA-AS?
> Cheers,
> -- Richard

View raw message