uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marshall Schor (JIRA)" <...@uima.apache.org>
Subject [jira] [Commented] (UIMA-3141) Binary CAS format 6 + type filtering fails to deserialize document annotation correctly
Date Sun, 04 Aug 2013 21:35:48 GMT

    [ https://issues.apache.org/jira/browse/UIMA-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13729006#comment-13729006

Marshall Schor commented on UIMA-3141:

I took a look at this, and it may be working as designed.

Here's what it appears is happening (I didn't run the test case (yet), just examined the code.

1) A CAS, sourceCas, is created, having a type system which includes a special type definition,
DocMeta, which is a subtype of the built-in uima.tcas.DocumentAnnotation type.  

1a) The code makes an instance of this type, and adds it to the indexes.

2) The sourceCas's method "setDocumentLanguage" method is called. This method looks up to
see if there is an indexed instance of this type, and finds the instance of the "DocMeta"
type, created in 1a); it then sets that type's language feature to "latin".

3) The new form 6 serializer serializes out the sourceCas, using it's type system, so all
"indexed" and reachable feature structures are serialized.

4) Now, the interesting part.  This file is deserialized, into the targetCas.  However, that
CAS has been defined without the special type DocMeta.  With form 6, this type mismatch is
allowed, and the semantics of this is that the deserialization process "filters" the feature
structures being deserialized, so that only those with type definitions in the receiving CAS
are deserialized, and the others are "skipped".

So - this results in the DocMeta feature structure instance being skipped.

I think this is why the getDocumentLanaguage call doesn't get the language set in the DocMeta
feature structure.

If you put the DocMeta type definition into the Target Cas's type system description, does
it change the behavior so that the getDocumentLanguage returns "latin"?
> Binary CAS format 6 + type filtering fails to deserialize document annotation correctly

> ----------------------------------------------------------------------------------------
>                 Key: UIMA-3141
>                 URL: https://issues.apache.org/jira/browse/UIMA-3141
>             Project: UIMA
>          Issue Type: Bug
>          Components: Core Java Framework
>    Affects Versions: 2.4.1SDK
>            Reporter: Richard Eckart de Castilho
>            Assignee: Marshall Schor
> When a custom document annotation type is used, the language is not properly restored
after deserializing from CAS format 6.
> Expected: deserialized CAS has language "latin"
> Actual: deserialized CAS has language "x-unspecified"
> If the line {{sourceCas.addFsToIndexes(ma);}} is commented out, the code works.
> {code}
> import static org.junit.Assert.assertEquals;
> import static org.junit.Assert.assertTrue;
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.FileOutputStream;
> import java.io.InputStream;
> import java.io.OutputStream;
> import org.apache.commons.io.IOUtils;
> import org.apache.uima.cas.CAS;
> import org.apache.uima.cas.impl.Serialization;
> import org.apache.uima.cas.text.AnnotationFS;
> import org.apache.uima.resource.metadata.TypeSystemDescription;
> import org.apache.uima.resource.metadata.impl.TypeSystemDescription_impl;
> import org.apache.uima.util.CasCreationUtils;
> import org.junit.Rule;
> import org.junit.Test;
> import org.junit.rules.TemporaryFolder;
> public class MinimalTest
> {
>     @Rule
>     public TemporaryFolder testFolder = new TemporaryFolder();
>     @Test
>     public void test()
>         throws Exception
>     {
>         TypeSystemDescription sourceTsd = new TypeSystemDescription_impl();
>         sourceTsd.addType("DocMeta", "", CAS.TYPE_NAME_DOCUMENT_ANNOTATION);
>         TypeSystemDescription targetTsd = new TypeSystemDescription_impl();
>         CAS sourceCas = CasCreationUtils.createCas(sourceTsd, null, null);
>         AnnotationFS ma = sourceCas.createAnnotation(sourceCas.getTypeSystem().getType("DocMeta"),
>                 0, 0);
>         sourceCas.addFsToIndexes(ma);
>         sourceCas.setDocumentLanguage("latin");
>         sourceCas.setDocumentText("test");
>         File file = testFolder.newFile("test.bin");
>         OutputStream os = new FileOutputStream(file);
>         Serialization.serializeWithCompression(sourceCas, os, sourceCas.getTypeSystem());
>         IOUtils.closeQuietly(os);
>         assertTrue(new File(testFolder.getRoot(), "test.bin").exists());
>         CAS targetCas = CasCreationUtils.createCas(targetTsd, null, null);
>         InputStream is = new FileInputStream(file);
>         Serialization.deserializeCAS(targetCas, is, sourceCas.getTypeSystem(), null);
>         IOUtils.closeQuietly(is);
>         assertEquals("latin", targetCas.getDocumentLanguage());
>         assertEquals("test", targetCas.getDocumentText());
>     }
> }
> {code}

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message