uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <kottm...@gmail.com>
Subject Re: Requirements / Wish List for CAS Store?
Date Thu, 10 Jan 2013 10:11:21 GMT

sorry I am bit late here. Anyway over at OpenNLP we built a Corpus Server
which basically stores CASes in a Derby database to make them available
to annotators and our training tooling.

You can have a look its here:

In my opinion there are a few important requirements in a CAS store to make
it suitable to host training data.

- It should have support for document collection
- Possibility to query for CASes in a collection which have already been 
- Ability to search for text or features inside the store (e.g. to 
correct a certain annotation problem)

The Corpus Server has a rest interface which can be used by our tooling 
to access the CASes inside it,
we also use the rest interface to make a corpus accessible to the Cas 
Editor. A user can search documents
and if he clicks on one it opens in the Cas Editor.
Would probably nice to have something like this, and additionally also 
web based.

I would not build a CAS store around an API like JDBC because many 
people would probably like to use
the interface Java and/or rest (or some other remote interface) with 
totally different type of stores.
For storing small amounts of data Derby might be a good solution, but 
for storing huge amounts of data
you might want to use something like hbase. And yet another group of 
people already has a store and would
like to implement their own bridge to the CAS store.

The nice thing about having a defined API for a CAS store is that many 
tools can be programmed against it.

In my experience XMI is not a good format to store CAS data in a 
database, because you always need to rewrite everything
when two bytes on a FS change. The same is true for reading data from 
the CAS, if you just want to get the text you need
to read the entire XMI. Maybe the best way to solve this is to define a 
new CAS serialization format which is better suited for storing
in a database.


On 01/08/2013 08:37 PM, Neal R Lewis wrote:
> Hello All, and Happy New Year!
> We've been working on our own  CAS Store for persisting CASes for our
> analytics platform.  There has been interest in this topic recently,
> specifically :
> http://article.gmane.org/gmane.comp.apache.uima.devel/15292
> Renaud discussed a module using MangoDB about a CAS Store:
> http://article.gmane.org/gmane.comp.apache.uima.devel/15429
>  From what I've seen in the UIMA Oasis Spec Version 1.0, there isn't any
> discussion as to what would be a standard CAS Store.  If someone has more
> information on a UIMA backed store, please let me know.
> Given  this interest, I was curious to ask the dev community:
> What would you like to see in a CAS Store?  What kind of requirements have
> you had in your experience with UIMA, with respect to a CAS Store?
> As was mentioned in the above threads, the impetus for a store seems to be
> the need for a way to store CASes that will be used later by a different
> analytic pipeline while still maintaining all CAS information.
> Below is a list of requirements that I have gleaned from this board and my
> own experiences.  Please add or comment on what you think would be the most
> useful.  Please note that I'm not necessarily concerned with implementation
> (e.g., SQL vs NoSQL) at this time.
>      1. Persist new CASes to the store
>      2. Query the store for a single CAS or a group of CASes
>      3. Query the store for a fragment  of a CAS (e.g., a sofa, view, or
> result)
>      4. Update stored CASes with new results from Analysis Operations -
> possibly the delta only
>      5. Provenance - This is one of our requirements where the ids of the
> CASes are maintained so as to provide evidence for our annotators after
> they've run on down stream analytics.
>      6. Universal identifiers for CASes.
> I can go into more detail about the above, if anyone is interested.
> Please let me know your thoughts!
> Thanks!
> Neal Lewis

View raw message