uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Fäßler <c...@gmx.net>
Subject Flexibility of binary CAS serialization
Date Wed, 12 Dec 2012 16:27:17 GMT
Hi,

i am currently looking for a good approach to store a lot of CAS data. What I want to do is
to annotate a lot of text with basic annotations and save that. Then, I can read the CAS objects
with these basic annotations and don't have to do them over and over because they are basically
never changing. However, "basic" does not necessarily mean that the computation is fast -
that's why I want the storage.

No I consideres binary storage because its fast and the resulting files not very big compared
to XMI serialization. But I have the requirement that I want to be able to extend the type
system (add features and types) with rendering the stored CAS objects useless.

I experimented with CASCompleteSerializer which of course does not offer this flexibility
(but I still wanted to see like it works). Now I was hoping, when I used CASSerializer, I
would perhaps get the flexibility I want.

I serialize with

ByteArrayOutputStream baos = new ByteArrayOutputStream();
Serialization.serializeCAS(aJCas.getCas(), baos);

and  I deserialize with

byte[] casData = ...
Serialization.deserializeCAS(aCAS, new ByteArrayInputStream(casData));

What DID work is when I add a feature to a serialized type, I can use the feature after deserialization
(that was not possible with CASCompleteSerializer). But when I add a new type which was not
part of the serialization, something odd happens: The AnalysisEngines seem to work fine. I
can read annotations which had been serialized before and I can add new ones and read them
again, too.
However, when I want to store the final result as an XMI (I did this for usage with the annotationViewer),
I get an error for the XMI serialization. The XMI serialization is done by

FileOutputStream out = new FileOutputStream(outFile);
XmiCasSerializer.serialize(aCas, out);
out.close();

which worked always fine. The error is

Caused by: java.lang.IndexOutOfBoundsException: Index: 59, Size: 52
	at java.util.ArrayList.RangeCheck(ArrayList.java:547)
	at java.util.ArrayList.get(ArrayList.java:322)
	at org.apache.uima.cas.impl.StringHeap.getStringForCode(StringHeap.java:150)
	at org.apache.uima.cas.impl.CASImpl.getStringForCode(CASImpl.java:2139)
	at org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeFeatures(XmiCasSerializer.java:892)
	at org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeFS(XmiCasSerializer.java:753)
	at org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeIndexed(XmiCasSerializer.java:700)
	at org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.serialize(XmiCasSerializer.java:268)
	at org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.access$700(XmiCasSerializer.java:108)
	at org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1567)
	at org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1638)
	at org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1585)
	at de.julielab.jules.consumer.CasToXmiConsumer.writeXmi(CasToXmiConsumer.java:338)
	at de.julielab.jules.consumer.CasToXmiConsumer.processCas(CasToXmiConsumer.java:288)
	at org.apache.uima.analysis_engine.impl.compatibility.CasConsumerAdapter.process(CasConsumerAdapter.java:99)
	at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:375)
	... 4 more

Is this behaviour expected or did I just miss something? I don't really need the XMI serialization
in my use case but I'm not too confident in the whole storage procedure when such an error
happens.

Thanks for any hints,

Erik
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message