uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neal R Lewis <nrle...@us.ibm.com>
Subject Re: Maintaining UIMAj indexing and references while using stable FSIDs in a CAS Store
Date Mon, 04 Mar 2013 18:28:30 GMT

Hi Richard.

	Thanks for commenting on my post.  I have so responses below.


> I posted this a couple weeks ago, and didn't get any traction, so I
thought
> I would try one more time for responses :)
>
> It was brought up recently in a meeting that we have to consider the
effect
> of a Feature Structure ID in a CAS / CAS Store on deserialization of a
CAS
> into UIMAj and the annotation indexing.
>
> e.g, How would adding a stable identifier affect indexing and references
> withing jCAS Objects?
>
> I'd like to throw out a couple scenarios to the community and see if
these
> cover all of the possible use cases, and discuss how I currently
implement
> it, and hopefully get some comments :)
>
> First, I'd like to confirm that I'm thinking of a CAS STore operating in
> between different PEARs or full UIMA Applications, not running between an
> Aggregate analytic (although that is definitely something to consider).

I don't understand what your are differentiating here. A PEAR can be a
component
in a larger pipeline. I suppose an application would rather stand alone
and not interact with other applications in any way. I would probably embed
some analysis pipelines, PEARs, aggregates, whatever.

-- I see your point.  Currently, our applications  all run as separate
JVMs, which we currently limit to 1 PEAR each.  Their interaction is done
through the CAS Store in that they retrieve CASes from previous analytics
and serialize them.

> Furthermore, I am assuming that the CAS Store interface retrieves a CAS
> object that agrees to the OASIS spec, and that the CAS store is
responsible
> for creating FSIDs.

I suppose you imply here that the FSIDs are not available once the CAS has
been loaded into memory because the OASIS spec does not include FSIDs?

-- Yes, the FSID is not available in the UIMAj after serialization
currently unless we explicitly add them to the type system.  We don't do
that because we want all our PEARs to run "out of the box", so to speak.
We do build other simple analytics that run on CASes per the spec, and
these can access FSIDs.

It may also be problematic if the FSIDs are just available after saving the
CAS
to the store and not after immediately after adding the FS to the CAS.

We currently use Store to add FSIDs so that we can maintain stable and
unique IDs.  One feature of a CAS Store is the ability to retrieve
projections of a CAS, or only the information needed by an analytic. This
means that a projection is only a fragment of the whole CAS, and is
impcomplete.    When the CAS is put back into the Store, the interface
checks the next available FSID and uses that as a starting point for
incrementing ids on new FSes as they come in.

One way around this could be to send the largest ID from the Store when it
is called, allowing the application to increment on new FSes as they come
in.  I'm sure there are other ways as well.



> I can think of four scenarios when deserializing a CAS xmi (I'm not sure
> about deserializing from binary) to a  jCAS object, as it comes from the
> CAS Store.
>
> 1:  A minimal CAS that contains only a sofa and view . This is the
simplest
> input to pull from a CAS Store, and doesn't require an modifications in
the
> UIMAj deserialization.

If a CAS contains more than one Sofa/View, then I suppose a modification is
necessary because the XmiCasDeserializer restores all sofas/views from XMI,
not only select ones. Furthermore, annotations can be indexed in one view
but
refer to annotations in another view. It could be problematic if this other
view is not available.

--  Our currently implementation allows for projections to be retrieved
from the Store, and in the projection is all necessary elements.  My
example here is referring to a brand new CAS with a single Sofa/View, such
as on that is created from a single text source when it is loaded into an
analytic.

> 2:  A full CAS with a SOFA and associated annotations in multiple views

That's probably the one where no modifications are necessary.

> 3:  A CAS Fragment (or projection) of a single CAS xmi from the store,
that
> contains only the information necessary for this particular Analytic
> Pipeline (there might or might not be a SOFA and view associated with
it).

I think it would be problematic to access feature structures unless they
are
indexed in a view. Note that I expect result of a retrieval operation is
always a UIMA CAS and not some other data structure or simply a list of
FSes.

--Agreed, a CAS will also be returned, but it can be relieved of
unnecessary FSes if one so wished.

> 4:  A CAS created from one or more analytics on different artifacts (zero
> or more cas:Sofa elements, and zero or more View elements)

I didn't understand that point. Do you mean you synthesize a CAS from
multiple
CASes? It sounds like combining 1, 2 or 3 with a CAS merger.

-- Perhaps this is not a good scenario then.  I thinking of retrieving
multiple CASes and performing a CAS merger.  With stable ids, this
shouldn't be an issue.


> Currently, if I use the FSID element, I have to set the deserialization
to
> LENIENT, or preprocess them out of the CAS before deserialization. This
> simply removes the unknown attributes.

This sounds like you just patch additional attributes into the XMI and then
discard
them during deserialization. I think this is problematic. Imagine I want to
retrieve
a set of FSes identified as A, B and C. I get back a CAS containing A, B
and C, but I
have no idea which one is which. In my opinion, the FSID must be accessible
through
the CAS API.

-- This is a great point, and why I brought it up.  The FSIDs are used by
the store to retreive the necessary FSes. All other information remains
that can be used by an AE to retrieve the proper feature structures.  I
agree that the FSID should be accessible through the API.

> For scenario 1, other than lenient serialization, nothing needs to be
> completed.
>
> For scenario 2 and 3, the associated Type System of the CAS must be
> registered for serialization.

A CAS always requires a type system. Unless one assumes that the type
system is
always provided by the context (e.g. application embedding the analysis,
the
runtime environment or automatic discovery mechanisms as used in uimaFIT),
then
I would expect is must be possible to ask the store for the type system for
any
CAS stored in it. So when a CAS is written to the store, then the some type
system
must be associated with it.


-- I'm not sure if I worded the scenario clearly, but I like the idea of
keeping the explicit Type System  in the Store.

> For scenerio 4, I haven't implemented yet in UIMAj, but will be working
on
> something for this soon.

There has been a post by Marshal (
http://markmail.org/thread/6q7demw2h3nzliyb)
pointing out several important issues that seem to apply in particular to
scenario 4. The post ends in a question how the scenario is envisioned, but
so
far no answer has been given.


-- I guess this is something to contemplate more then :)


Thanks again Richard, you've given me a lot to think about !


Neal

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message