uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Fäßler <erik.faess...@uni-jena.de>
Subject Re: Requirements / Wish List for CAS Store?
Date Wed, 09 Jan 2013 13:58:11 GMT
Hi and a Happy New Year for you, too!

I am currently using a Postgres DB just for storing CAS XMIs for later use. I went for XMI
because it is more flexible when processing further than the binary format (you already mentioned
the point to use the storage as an intermediate storage and continue processing with different
pipelines). Thus, I have to do (de)serialization to and from XML; not the fastest solution,
but I rely on its flexibility.
Now I noticed I would like the ability to not store whole CASes but only to store some annotations
in a way that I can later easily add them back to a CAS containing the corresponding text.
So you could build a storage which holds - independently from each other - the original text
and the corresponding meta data.

The idea would then be to be able to assemble the original text data together with only some
specified annotations. Some people will need syntactic analysis, others are only interested
in a particular type of named entity etc. This would safe space and probably also time and
would be a lot more flexible then what I currently do. I have to formulate a linear pipeline
although there are multiple annotation types (named entities) which are perfectly independent
of each other. Thus, at later stages of the pipeline, the CAS data becomes huge because of
all the meta data involved, of which the largest part is not even required for the processing
step. Why do I bother to load tons of named entity types where I actually only need the token
annotations to recognize another entity type?

Just my quick thoughts to this topic :-)

Best regards,

	Erik

Am 08.01.2013 um 20:37 schrieb Neal R Lewis <nrlewis@us.ibm.com>:

> 
> 
> Hello All, and Happy New Year!
> 
> We've been working on our own  CAS Store for persisting CASes for our
> analytics platform.  There has been interest in this topic recently,
> specifically :
> 
> http://article.gmane.org/gmane.comp.apache.uima.devel/15292
> 
> Renaud discussed a module using MangoDB about a CAS Store:
> 
> http://article.gmane.org/gmane.comp.apache.uima.devel/15429
> 
> From what I've seen in the UIMA Oasis Spec Version 1.0, there isn't any
> discussion as to what would be a standard CAS Store.  If someone has more
> information on a UIMA backed store, please let me know.
> 
> Given  this interest, I was curious to ask the dev community:
> 
> What would you like to see in a CAS Store?  What kind of requirements have
> you had in your experience with UIMA, with respect to a CAS Store?
> 
> As was mentioned in the above threads, the impetus for a store seems to be
> the need for a way to store CASes that will be used later by a different
> analytic pipeline while still maintaining all CAS information.
> 
> Below is a list of requirements that I have gleaned from this board and my
> own experiences.  Please add or comment on what you think would be the most
> useful.  Please note that I'm not necessarily concerned with implementation
> (e.g., SQL vs NoSQL) at this time.
> 
>     1. Persist new CASes to the store
>     2. Query the store for a single CAS or a group of CASes
>     3. Query the store for a fragment  of a CAS (e.g., a sofa, view, or
> result)
>     4. Update stored CASes with new results from Analysis Operations -
> possibly the delta only
>     5. Provenance - This is one of our requirements where the ids of the
> CASes are maintained so as to provide evidence for our annotators after
> they've run on down stream analytics.
>     6. Universal identifiers for CASes.
> 
> 
> I can go into more detail about the above, if anyone is interested.
> 
> Please let me know your thoughts!
> 
> Thanks!
> 
> 
> Neal Lewis


Mime
View raw message