uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Gruhl (JIRA)" <...@uima.apache.org>
Subject [jira] [Commented] (UIMA-5106) uv3 constant "id" for FSs (Proposed new Feature for uv3)
Date Thu, 17 Nov 2016 16:11:58 GMT

    [ https://issues.apache.org/jira/browse/UIMA-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674095#comment-15674095

Daniel Gruhl commented on UIMA-5106:

In systems with persistent analytics (that is, where CAS are stored long term and incrementally
annotated, often by humans) it is very helpful to have a stabile UUID to a feature structure.
For example, there may be a document in a CAS that is under analysis. Being able to refer
to a span of that sofa and send it to a human for review or adjudication is very helpful.
It also allows the use of CAS to hold "entity information", that is, frames of knowledge,
or to represent higher level concepts (e.g., a web site CAS can be pointed to by all it's
page CAS).

This was critical in large persistent UIMA system such as WebFountain and it would be nice
to see it make its way into the standard.

> uv3 constant "id" for FSs (Proposed new Feature for uv3)
> --------------------------------------------------------
>                 Key: UIMA-5106
>                 URL: https://issues.apache.org/jira/browse/UIMA-5106
>             Project: UIMA
>          Issue Type: New Feature
>          Components: Core Java Framework
>            Reporter: Marshall Schor
>            Priority: Minor
> Add constant ID for FSs. This would be an incrementing, long value. It would be constant
through serialization/ deserialization cycles. There would be a lazily created map from longs
to FSs (via weak links) to allow direct access from the ID to the FS.  Lazy intent is to not
have a cost for this (space/time) other than the cost for 1 long / FS, if it is not used.
> We could make this feature optional, as well, to avoid the 8 bytes per FS overhead, but
in V3, I think that's not a good tradeoff (space savings vs complexity).  
> Issues: 
> * Current design allows parallelism of services, with returned results "stacked" into
receiving CAS; would need to change (some of) the IDs coming back.
> CAS would need to have the high-water-mark value as part of serializations.
> Backwards compatibility:
> * loading V2 CASs: generate new IDs upon loading.
> * serializing to V2: (for connecting to V2 services): drop the IDs.
> This is a proposed new V3 feature; comments appreciated.

This message was sent by Atlassian JIRA

View raw message