uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marshall Schor (JIRA)" <...@uima.apache.org>
Subject [jira] [Comment Edited] (UIMA-4049) The curious case of the zombie annotation
Date Mon, 13 Oct 2014 14:21:33 GMT

    [ https://issues.apache.org/jira/browse/UIMA-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168778#comment-14168778
] 

Marshall Schor edited comment on UIMA-4049 at 10/13/14 2:20 PM:
----------------------------------------------------------------

After taking another look, I see the following.  

The Annotator index is a sorted index using 3 "keys": the begin, the end (features), and the
type priorities (not used here).

The "remove" operation is defined to take the FS given, and uses its feature values to be
the
"keys", and does a find operation for that FS in the index.  See FsIntArrayIndex line 410.
The find operation first does a binary search (which only correctly works if the items are
sorted (which is unfortunately not the case here).  Following the find, a subsearch is done
to locate the FS whose "id" matches the one being removed (because there could be multiple
FSs with equal key values).

What happens in this case is:
  1) token with id 21 (Frederich) is added to the index  (begin 12, end 21)
  2) token with id 25 (II.) is added to the index (begin 22, end 25)  
  3) token with id 25 is modified; the begin is changed (while it is indexed) to 12.
  4) The remove operation attempts to find the item to be removed.  Because it is looking
in a sorted index, it does a binary search for the token whose begin is 12 and end is 21 (the
token with id 21).  Find does a binary search.
       - The 2nd probe of the binary search hits token 21 (Frederich).  It should find that
the token being searched for is > than token 21.  However, the token being searched for
(token id 25) was modified; it's begin is now == to that of token 21, and its end feature
is > that that of token 21.  So the compare incorrectly concludes that the token being
searched for is earlier in the list. 

       - If token 25 had not had its begin value updated, the compare would have found that
the test token was later in the list (because its begin value was higher.

So, bottom line, the find operation fails to find the item to be removed, and the remove fails.

The reindex results in adding the modified token with id 25 into the Annotation index again,
so it appears twice. 
 
The index looks like this: (I'm typing from the Eclipse debugger, looking at the CAS, with
"show logical structure" turned on)
{code} 
Contents of the Index:
 [0] DocumentAnnotation
 [1] Token Dies
 [2] Token flosse
 [3] Token Friedrich II.     // the reindexed item
 [4] Token Friedrich         // the failed-to-be-removed item
 [5] Token Friedrich II.     // the original indexed II. token with its begin modified.
 ...
{code}

So - the correct way to modify a currently indexed FS which changes the values of any keys
is:

1) first remove it from the indexes (before you modify it)
2) do the modifications
3) add then it back to the index (assuming you want it to be indexed again).

I modified the loop which changes the begin values to not do any updating to the token at
that point, but just to add to a list things which needed to be done.  Then, later, outside
of the Annotation iterator, I had the code go thru what needs to be done, and had the modification
to the token begin values occur using the remove - modify - add back to indexes approach.
 This worked in either order.

I see this method and restrictions, etc., is documented here http://uima.apache.org/downloads/releaseDocs/2.2.2-incubating/docs/html/references/references.html#ugr.ref.jcas.adding_removing_instances_to_indexes
.

I do agree with you that it would be good to have some kind of automated checking of this;
if someone can suggest a way that has minimal impact on correctly done code, it would be great
to hear about it.
   


was (Author: schor):
After taking another look, I see the following.  

The Annotator index is a sorted index using 3 "keys": the begin, the end (features), and the
type priorities (not used here).

The "remove" operation is defined to take the FS given, and uses its feature values to be
the "keys", and does a find operation for that FS in the index.  See FsIntArrayIndex line
410.  
The find operation first does a binary search (which only correctly works if the items are
sorted (which is unfortunately not the case here).  Following the find, a subsearch is done
to locate the FS whose "id" matches the one being removed (because there could be multiple
FSs with equal key values).

What happens in this case is:
  1) token with id 21 (Frederich) is added to the index  (begin 12, end 21)
  2) token with id 25 (II.) is added to the index (begin 22, end 25)  
  3) token with id 25 is modified; the begin is changed (while it is indexed) to 12.
  4) The remove operation attempts to find the item to be removed.  Because it is looking
in a sorted index, it does a binary search for the token whose begin is 12 and end is 21 (the
token with id 21).  Find does a binary search.
       - The 2nd probe of the binary search hits token 21 (Frederich).  It should find that
the token being searched for is > than token 21.  However, the token being searched for
(token id 25) was modified; it's begin is now == to that of token 21, and its end feature
is > that that of token 21.  So the compare incorrectly concludes that the token being
searched for is earlier in the list. 

       - If token 25 had not had its begin value updated, the compare would have found that
the test token was later in the list (because its begin value was higher.

So, bottom line, the find operation fails to find the item to be removed, and the remove fails.

The reindex results in adding the modified token with id 25 into the Annotation index again,
so it appears twice. 
 
The index looks like this: (I'm typing from the Eclipse debugger, looking at the cas, with
"show logical structure" turned on)
{code} 
Contents of the Index:
 [0] DocumentAnnotation
 [1] Token Dies
 [2] Token flosse
 [3] Token Friedrich II.     // the reindexed item
 [4] Token Friedrich         // the failed-to-be-removed item
 [5] Token Friedrich II.     // the original indexed II. token with it's begin modified.
 ...
{code}

So - the correct rule for modifying anything which is added to the index which changes the
values of any keys is:

1) remove it from the indexes (before you modify it)
2) do the modifications
3) add it back to the index (assuming you want it to be indexed again.

I modified the loop which changes the begin values to not do any updating to the token at
that point, but just to add to a list things which needed to be done.  Then, later, outside
of the Annotation iterator, I had the code go thru what needs to be done, and had the modification
to the token begin values occur using the remove - modify - add back to indexes approach.
 This worked in either order.

I see this method and restrictions, etc., is documented here http://uima.apache.org/downloads/releaseDocs/2.2.2-incubating/docs/html/references/references.html#ugr.ref.jcas.adding_removing_instances_to_indexes
.

I do agree with you that it would be good to have some kind of automated checking of this;
if someone can suggest a way that has minimal impact on correctly done code, it would be great
to hear about it.
   

> The curious case of the zombie annotation
> -----------------------------------------
>
>                 Key: UIMA-4049
>                 URL: https://issues.apache.org/jira/browse/UIMA-4049
>             Project: UIMA
>          Issue Type: Bug
>          Components: Core Java Framework
>            Reporter: Richard Eckart de Castilho
>            Assignee: Marshall Schor
>         Attachments: CuriousTestCase.java
>
>
> When annotations are removed from indexes, sometimes they come back... the following
test case shows how an annotation is removed but still present when iterating over the index
later.
> {code}
>     @Test
>     public void testForZombies() throws Exception
>     {
>         // No zombie here
>         int[] offsets1 = { 0, 4, 5, 11, 12, 21, 22, 25, 26, 29, 30, 35, 36, 40, 41, 50,
51, 60, 61,
>                 64, 64, 65 };
>         testForZombies("Dies flößte Friedrich II. für seine neue Eroberung Besorgnis
ein.", offsets1);
>         
>         // Zombie hiding in here
>         int[] offsets2 = { 0, 3, 4, 7, 8, 13, 14, 18, 19, 22, 23, 33, 34, 35 };
>         testForZombies("Ich bin Franz III. von Hammerfels !", offsets2);
>     }
>     public void testForZombies(String aText, int[] aOffsets) throws Exception
>     {
>         // Init some dictionaries we ues
>         Set<String> names = new HashSet<String>();
>         names.add("Friedrich");
>         names.add("Franz");
>         Set<String> suffix = new HashSet<String>();
>         suffix.add("II.");
>         suffix.add("III.");
>         // Set up type system
>         TypeSystemDescription tsd = new TypeSystemDescription_impl();
>         tsd.addType("Token", "", CAS.TYPE_NAME_ANNOTATION);
>         
>         // Create CAS
>         CAS jcas = CasCreationUtils.createCas(tsd, null, null);
>         jcas.setDocumentText(aText);
>         
>         Type tokenType = jcas.getTypeSystem().getType("Token");
>         Feature beginFeature = tokenType.getFeatureByBaseName("begin");
>         
>         // Create tokens in CAS
>         for (int i = 0; i < aOffsets.length; i += 2) {
>             jcas.addFsToIndexes(jcas.createAnnotation(tokenType, aOffsets[i], aOffsets[i+1]));
>         }
>         
>         // List the tokens in the CAS
>         for (AnnotationFS token : jcas.getAnnotationIndex(tokenType)) {
>             System.out.printf("Starting with %s%n", token.getCoveredText());
>         }
>         // Merge some tokens, in particular "Franz" "III." -> "Franz III." and "Friedrich"
"II."
>         // into "Friedrich II."
>         AnnotationFS previous = null;
>         List<AnnotationFS> toDelete = new ArrayList<>();
>         for (AnnotationFS token : jcas.getAnnotationIndex(tokenType)) {
>             if (previous != null && names.contains(previous.getCoveredText())
>                     && suffix.contains(token.getCoveredText())) {
>                 token.setIntValue(beginFeature, previous.getBegin());
>                 toDelete.add(previous);
>             }
>             previous = token;
>         }
>         // Remove the no longer necessary tokens ("Friedrich" and "Franz"), since we
expanded the
>         // following tokens "III." and "II." to include their text
>         Set<String> removedWords = new HashSet<String>();
>         for (AnnotationFS token : toDelete) {
>             System.out.printf("Removing %s%n", token.getCoveredText());
>             removedWords.add(token.getCoveredText());
>             jcas.removeFsFromIndexes(token);
>         }
>         // Check if the tokens that we wanted to remove are really gone
>         for (AnnotationFS token : jcas.getAnnotationIndex(tokenType)) {
>             System.out.printf("Remaining %s%n", token.getCoveredText());
>             if (removedWords.contains(token.getCoveredText())) {
>                org.junit.Assert.fail("I saw a zombie!!!");
>             }
>         }
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message