uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard Eckart de Castilho (JIRA)" <...@uima.apache.org>
Subject [jira] [Updated] (UIMA-4049) The curious case of the zombie annotation
Date Sun, 12 Oct 2014 08:20:33 GMT

     [ https://issues.apache.org/jira/browse/UIMA-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Richard Eckart de Castilho updated UIMA-4049:
---------------------------------------------
    Attachment: CuriousTestCase.java

That also was the first solution that I though of. However, it is not as easy as that I am
afraid. If I do reindexing before removing, as you suggest, then I'm suddenly seeing "Friedrich
II." twice in the output. If I switch the order, remove first, and then reindex, I'm back
with the zombie "Franz" (see new attached test case).

Probably the code would need to be rewritten to remove the tokens from the index, then update
the offset, and then add them back again. But this makes a normally rather trivial operation
rather complex. Also, there is absolutely no warning or exception about the invalid operation
"indexed feature value changed while FS is in index". 

What's also nasty is, that the problem does not appear always, but only with certain CAS configurations
(probably related to the length of the text and/or related to the positions of the annotations).

As it is now, I think can be pretty hard to detect and unnerving for users, in particular
new users. Isn't there some way that can be handled better?

> The curious case of the zombie annotation
> -----------------------------------------
>
>                 Key: UIMA-4049
>                 URL: https://issues.apache.org/jira/browse/UIMA-4049
>             Project: UIMA
>          Issue Type: Bug
>          Components: Core Java Framework
>            Reporter: Richard Eckart de Castilho
>            Assignee: Marshall Schor
>         Attachments: CuriousTestCase.java
>
>
> When annotations are removed from indexes, sometimes they come back... the following
test case shows how an annotation is removed but still present when iterating over the index
later.
> {code}
>     @Test
>     public void testForZombies() throws Exception
>     {
>         // No zombie here
>         int[] offsets1 = { 0, 4, 5, 11, 12, 21, 22, 25, 26, 29, 30, 35, 36, 40, 41, 50,
51, 60, 61,
>                 64, 64, 65 };
>         testForZombies("Dies flößte Friedrich II. für seine neue Eroberung Besorgnis
ein.", offsets1);
>         
>         // Zombie hiding in here
>         int[] offsets2 = { 0, 3, 4, 7, 8, 13, 14, 18, 19, 22, 23, 33, 34, 35 };
>         testForZombies("Ich bin Franz III. von Hammerfels !", offsets2);
>     }
>     public void testForZombies(String aText, int[] aOffsets) throws Exception
>     {
>         // Init some dictionaries we ues
>         Set<String> names = new HashSet<String>();
>         names.add("Friedrich");
>         names.add("Franz");
>         Set<String> suffix = new HashSet<String>();
>         suffix.add("II.");
>         suffix.add("III.");
>         // Set up type system
>         TypeSystemDescription tsd = new TypeSystemDescription_impl();
>         tsd.addType("Token", "", CAS.TYPE_NAME_ANNOTATION);
>         
>         // Create CAS
>         CAS jcas = CasCreationUtils.createCas(tsd, null, null);
>         jcas.setDocumentText(aText);
>         
>         Type tokenType = jcas.getTypeSystem().getType("Token");
>         Feature beginFeature = tokenType.getFeatureByBaseName("begin");
>         
>         // Create tokens in CAS
>         for (int i = 0; i < aOffsets.length; i += 2) {
>             jcas.addFsToIndexes(jcas.createAnnotation(tokenType, aOffsets[i], aOffsets[i+1]));
>         }
>         
>         // List the tokens in the CAS
>         for (AnnotationFS token : jcas.getAnnotationIndex(tokenType)) {
>             System.out.printf("Starting with %s%n", token.getCoveredText());
>         }
>         // Merge some tokens, in particular "Franz" "III." -> "Franz III." and "Friedrich"
"II."
>         // into "Friedrich II."
>         AnnotationFS previous = null;
>         List<AnnotationFS> toDelete = new ArrayList<>();
>         for (AnnotationFS token : jcas.getAnnotationIndex(tokenType)) {
>             if (previous != null && names.contains(previous.getCoveredText())
>                     && suffix.contains(token.getCoveredText())) {
>                 token.setIntValue(beginFeature, previous.getBegin());
>                 toDelete.add(previous);
>             }
>             previous = token;
>         }
>         // Remove the no longer necessary tokens ("Friedrich" and "Franz"), since we
expanded the
>         // following tokens "III." and "II." to include their text
>         Set<String> removedWords = new HashSet<String>();
>         for (AnnotationFS token : toDelete) {
>             System.out.printf("Removing %s%n", token.getCoveredText());
>             removedWords.add(token.getCoveredText());
>             jcas.removeFsFromIndexes(token);
>         }
>         // Check if the tokens that we wanted to remove are really gone
>         for (AnnotationFS token : jcas.getAnnotationIndex(tokenType)) {
>             System.out.printf("Remaining %s%n", token.getCoveredText());
>             if (removedWords.contains(token.getCoveredText())) {
>                org.junit.Assert.fail("I saw a zombie!!!");
>             }
>         }
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message