uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <kottm...@gmail.com>
Subject Re: CAS Id
Date Tue, 04 Oct 2011 09:34:05 GMT
On 10/3/11 3:37 PM, Eddie Epstein wrote:
> As Marshall pointed out above, a CAS can have many CAS Views, each
> with its own artifact. An analysis pipeline knows where these
> artifacts come from and can set metadata appropriately, but a unique
> ID for a stored copy of the CAS might best be determined by the
> persistent CAS storage system where the CAS is to be stored.

To summarize what has been said.
A unique ID per CAS seems to be useful for logging (and debugging) in
user code, because the IDs logged by the framework can be related to IDs 
logged
by user code.
A CAS ID might not work in complex type systems which use multiples 
views, because
each sofa in a multi-view CAS might have a different source ID.

Beside that, there are UIMA pipelines which always store a complete CAS 
object in some kind
of storage. There the CAS ID can just be the unique storage ID. This 
could for example be a file
system, or an HBase row key. As pointed out this might not work for 
complex cases, but could
be helpful for simpler UIMA pipelines.

Our Solrcas AE could also just use the CAS ID by default, if the user 
does not specify an Document ID
Feature Structure. In my applications this would actually work quite well.

More complex applications could also decide to use mime/type, features 
in a view as additional
information to complement the CAS ID in a newly created view, in order 
to compute a storage ID.
For example a UIMA pipeline which translates the input document text to 
english, and then stores the
new text in a new english view. The code can then compute an ID which is 
based on the unique CAS ID.

In the end I believe a simple CAS ID field could be quite useful, for 
debugging/logging, as a
document ID in simple UIMA pipelines and for applications which deal 
with whole CASes
(e.g. the Cas Editor based annotation tooling, or an AE which extracts 
"problematic" CASes
from an analysis pipeline for inspection).

To implement this I suggest that we extend to CAS interface with
CAS.setId(String) and CAS.getId() methods.

Jörn

Mime
View raw message