uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Eckart de Castilho <eck...@ukp.informatik.tu-darmstadt.de>
Subject Re: Unique IDs for Feature Structure instances - 3 observations
Date Wed, 06 Mar 2013 11:47:26 GMT
Some blue-sky thinking on this…

Am 05.03.2013 um 18:17 schrieb Marshall Schor <msa@schor.com>:

> Some of this has been previously stated.  I'm summarizing :-)
> ------------
> It seems these would be nice to have at runtime, not just externally.
> Assigning them at runtime has potential issues for "parallel" processing of
> CASes.  Parallelism can arise in UIMA-AS scheduling using the flow controller
> parallel-step option. 
> This can also arise in a simple application associated with a CAS Store, where
> the operation is to deserialize an existing CAS, add FSs to it, and reserialize
> the result back to the store *under the same CAS id*.
> The parallel use case here is that many of these operations could occur
> simultaneously. Of course, the reserializing would need to take account of the
> "high-water-mark" - just as is done for the flow-controller parallel-step
> option.  In that case, we also declare it is "illegal" for annotators to update
> feature structures "below the high-water-mark", because if two annotators
> updated the same slot, then the later one would "win", and the previous update
> would be "lost".
> Running in parallel means it may be hard to assign at FS creation time the
> "next" available unique FS id - so that's a problem to address.

Assuming that a CAS cannot be actually shared between AEs running in parallel,
at least a CAS-local ID should be assignable. 

It might not be necessary that the ID assigned at runtime is the actual FSID. 
If e.g. the CAS address of an FS can be resolved to an FSID at some point, or
vice versa, that may be sufficient.

For example, when I query a CAS store for the FSes A, B and C, and get back
a CAS containing them (and possible transitive references), a mechanism that
resolves FSID->CAS-address and CAS-address->FSID would be enough to figure out
which of the FSes are the ones I was looking for. As such, the FSID wouldn't 
even have to be physically stored within the CAS. 

Maybe an external resource could be injected into AEs which provides this
resolving capability according to some rules defined by the CAS store.

By the same external resource, it may be possible for a CASMultiplier or FlowController
to announce that a CAS is being split and that the different splits should get a
special ID-prefix, so that during the merge, no conflicts occur. 

If the different AAE instances in a distributed environment are completely
agnostic of each other, each instance could use a freshly-generated UUID for

Some out-of-band support for transmitting FSIDs in a distributed scenario may
be necessary, meaning independent of the CAS.

The concurrent update and "high-water-mark" situation remains and seems
independent of FSIDs.

> --------------
> Another (potential) problem: if the FS id is added, this represents potentially
> a significant increase in the CAS size.  For some applications, this could be an
> issue.  So I hope the architecture allows modes of operation where there is no
> space taken in the CAS for this.  Something like this may be needed also for
> backwards compatibility.

If it was possible to externalize the FSID mechanism as suggested above, the CAS
itself wouldn't grow in memory. 

When taking serialization mechanisms like XMI or binary CAS into account, a
FSID->CAS-address or FSID->XMI id mapping could be explicitly maintained in a
separate file (e.g. for the binary CAS) or partially implicitly in the case
of XMI, where the XMI-ID could be the CAS-local stable ID and only the CAS
ID would need to be stored in addition, assuming that any prefixes get merged
into the XMI ID.

> --------------
> It may be that many FSs in the CAS won't need a unique FSid.  An example: UIMA
> supports lists made out of Lisp-like "cons" cells - the FSList structure has 2
> slots - one is a reference (or nil) to the next cons object, the other is a
> reference to the item in the list at that spot.  I've seen applications that
> have 1000's or more of these cons cells.  They are never individually "indexed"
> (except perhaps occasionally the "head" of the list), but just serve to create
> the list.
> I wonder if an architecture for unique FSids could account for this, and not
> have any overhead for some FeatureStructures which won't need a unique FSid.

It may be reasonable to require that a pipeline explicitly requests that FSIDs
are generated/maintained for specific types. This might be done via the extneral
resource mentioned above on a per-component-basis, or globally when configuring
the external resource (or the underlying CAS-Store) in the first place.


-- Richard

Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab (UKP-TUD) 
FB 20 Computer Science Department      
Technische Universität Darmstadt 
Hochschulstr. 10, D-64289 Darmstadt, Germany 
phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117
Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de

View raw message