uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thilo Goetz <twgo...@gmx.de>
Subject Document collections [was: Re: Building the eclipse update site]
Date Fri, 24 Jul 2009 08:22:16 GMT
Jörn Kottmann wrote:
> Jörn Kottmann wrote:
>>> A collection of text documents that you can run
>>> analysis on.  If I understand correctly, the Cas
>>> Editor currently requires XCAS/XmiCAS files.  It
>>> would be nice if users could just add their text
>>> files and then either create annotations manually
>>> with the Cas Editor, or automatically by running
>>> some analysis and then view the results using the
>>> Cas Editor.  Then we could add results comparison
>>> etc.  See
>>> http://dl.alphaworks.ibm.com/technologies/tap/text_analysis_perspective.pdf
>>> for a (outdated) description of what we have
>>> in-house.  It's geared more towards a business user
>>> than a developer, but the ideas of document collections
>>> and the development cycle are equally applicable.
>>> If there was enough interest here, I think that
>>> would be a good direction to go in.
>> Yes for me it sounds like the right way.
>> We could also use it for debugging an AE, then
>> a user defines a debug configuration and adds
>> the collection as document source.
> How would you define the format of a document collection ?
> To open a CAS document the document itself and a type system
> for the document is needed.
> In the Cas Editor right now an Input Collection is a Corpus folder which
> contains xmi/xcas files
> in one directory together with the project type system the files can be
> loaded by UIMA. Though
> it has be criticized for not allowing sub directories for structuring
> its documents.
> Jörn

That's perfectly fine, we do this in a similar way.
What would be good though is to distinguish between
text documents and "CAS documents" (be they XCAS, XMI
or some other format).  So you could start your work
by importing some text documents, then annotate them
in various ways (manually, or with coded annotators).
The CASes would reside in a different folder, and you
could derive any number of CAS collections from the
same set of source text documents.  We find that way
of working very convenient.


View raw message