uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: changing edge case impl details in casCopiers
Date Mon, 04 Apr 2016 20:41:55 GMT
This continues to be an interesting set of use cases :-)

On 4/1/2016 5:28 PM, Richard Eckart de Castilho wrote:
> Hi,
> I would say as long as the CasCopier doesn't simply fail if it thinks that a copy wound
be invalid/unsafe and as long as one can fix potentially broken copies afterwards, it would
be in general ok. Ok, existing code might break...
breaking existing code - probably a bad thing, and to be avoided ...
> The use-case below was half hypothetical. Very real is a reverse use-case which we have
implemented in DKPro Core.
> * view A contains a text
> * view B is created through a transformation of the text from A
> * annotations are created in view B
> * annotations are copied back to view A
> * offsets in the copied annotations are updated based on a reverse of the transformation
operation in the second step
> The code we currently use to handle the copying back looks like this:
> CasCopier copier = new CasCopier(inputCas, outputCas);
> for (FeatureStructure fs : selectFS(inputCas, getType(inputCas, typeName))) {
>   if (!copier.alreadyCopied(fs)) {
>     FeatureStructure fsCopy = copier.copyFs(fs);
>     // Make sure that the sofa annotation in the copy is set
>     if (fs instanceof AnnotationBaseFS) {
>       FeatureStructure sofa = fsCopy.getFeatureValue(mDestSofaFeature);
>       if (sofa == null) {
>         fsCopy.setFeatureValue(mDestSofaFeature, outputCas.getSofa());
>       }
>     }
>     aOutput.addFsToIndexes(fsCopy);
>   }
> }
The existing CasCopier code, when copying a FS which is a subtype of
AnnotationBase, copies the sofa ref by getting the "corresponding" sofa in the
target CAS.  It does this by getting the sofa whose sofa number is the same. 

In the use case:

* view A contains a text
* view B is created through a transformation of the text from A
* annotations are created in view B
* annotations are copied back to view A
* offsets in the copied annotations are updated based on a reverse of the transformation operation
in the second step

it would seem that step 3 (annotations created in view B) would create them in
view "B".   So the sofa references of annotations (which are subtypes of
AnnotationBase) would refer to the sofa associated with view "B".

In the code example above, a cas copier is created to go from view "B" as the
source to view "A".  (I'm assuming the cas copier creation call is passing in
two CAS "views", the source being some CAS's view "B", and the target being some
CAS's view "A".   It's ambiguous whether or not these are two separate CASes, or
two views of the same CAS (can you clarify?).

The copyFs call sets the sofa ref in the copy to point to the sofa in the target
CAS which has the same sofa number as the sofa had in view "B" (the source CAS
view), unless the source had null for the sofa reference, in which case, the
target is left as null. 

It's possible that this might accidentally "work" for some view populations of
the two CASes.

> Source: https://github.com/dkpro/dkpro-core/blob/7c8785647ca8c5905aa108251935069e601cbb8d/dkpro-core-api-transform-asl/src/main/java/de/tudarmstadt/ukp/dkpro/core/api/transform/JCasTransformer_ImplBase.java#L99
> I guess this code would still work and wouldn't throw exceptions or such.
It might or might not, depending on whether or not there's a hard constraint
that the source and target CASes be the same CAS (with multiple views).  It
doesn't work, I think, in all cases when there are multiple CASes/ multiple views.
> If I understand the diagrams in the wiki correctly, there is one case where the sofa
of the copied FS points to the source view but the FS in indexed in the target view. This
seems to be the only difference between the case copying between CASes and within a CAS. I
think it may be better/simpler/more consistent to set the sofa of the copy to null in both
cases and if the user really wants the FS to point to a sofa in a different view, then he
should set the sofa in this was manually after the copy is complete.

I'm hoping to have an approach which won't break backwards compatibility...
> Btw... at least when copying individual FSes, the copy isn't indexed anyway by the CasCopier.
We are talking only about the bulk-copy method then?

You are correct, the copy isn't indexed when you use the copyFs API.  However,
it's sofa reference is set, and if set "wrong", an attempt to add the fs to the
indexes will throw an error.  This check was added in version 2.7.0, and
intended to prevent accidents. 

In an earlier note, I said maybe we could add an API to allow updating the sofa
reference.  The DKPro code above found a way using existing APIs to do this; we
could just keep this.

> Cheers,
> -- Richard
>> On 01.04.2016, at 15:57, Marshall Schor <msa@schor.com> wrote:
>> Hi Richard,
>> Thanks for this use-case.  I think there may be 2 subcases.
>> 1) The views, A and B, are in the same CAS, and
>> 2) The views, A and B, are in different CASes
>> In case 1), with this new proposal the annotations copied from view A to B would
>> have their "sofa" reference continue to point to the text in view A.  This means:
>> a) The references into the text are still "valid", but of course point to the
>> text in view A.
>> b) To do the updating process to have them point to the de-xml'ed version of the
>> text, not only do the begin/end references need to be updated, but the sofa
>> reference needs to be changed.  We could add an API to update that to the
>> current view's.
>> In case 2), the annotations in B would no longer have a valid sofa reference at
>> all (it would be set to null).
>> This would clearly be a problem; but once again, we could add an API to update
>> that to the current view's.
>> --------------------------------
>> So, it looks like this proposed design change would break the use-case you
>> suggested. 
>> The current design would seems to support this use case but only if the two
>> views are in different CASes.
>> If they were in the same CAS, I think the current implementation (not tested,
>> just reading the code) would have the copied Annotations have their sofa
>> references be to the sofa in CAS A.
>> Does this match what you're currently seeing?
>> -Marshall
>> On 3/31/2016 4:36 PM, Richard Eckart de Castilho wrote:
>>> On 31.03.2016, at 21:22, Marshall Schor <msa@schor.com> wrote:
>>>> I'm thinking of changing how cas copier works with respect to managing Sofas
>>>> sofa ref updating.  I've written something up here:
>>>> https://cwiki.apache.org/confluence/display/UIMA/CasCopier+and+Views
>>>> Comments / feedback / what did I overlook?  appreciated :-) -Marshall
>>> Consider the following case:
>>> - there are two views, A and B
>>> - the text in B has been derived from A through some transformation, e.g. the
removal of XML tags
>>> - A contains UIMA annotations that represent the XML tags and the point into
the text in A
>>> - as part of a second transformation process, all annotations in A are to be
copied into B
>>> - after the copy has been performed, the offsets of the copied annotations are
>>> Would such a scenario still be supported after the changes you suggest?
>>> Best,
>>> -- Richard

View raw message