uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thilo Goetz <twgo...@gmx.de>
Subject Re: Document collections [was: Re: Building the eclipse update site]
Date Fri, 24 Jul 2009 11:29:21 GMT
Jörn Kottmann wrote:
> Thilo Goetz wrote:
>> Jörn Kottmann wrote:
>>> Jörn Kottmann wrote:
>>>>> A collection of text documents that you can run
>>>>> analysis on.  If I understand correctly, the Cas
>>>>> Editor currently requires XCAS/XmiCAS files.  It
>>>>> would be nice if users could just add their text
>>>>> files and then either create annotations manually
>>>>> with the Cas Editor, or automatically by running
>>>>> some analysis and then view the results using the
>>>>> Cas Editor.  Then we could add results comparison
>>>>> etc.  See
>>>>> http://dl.alphaworks.ibm.com/technologies/tap/text_analysis_perspective.pdf
>>>>> for a (outdated) description of what we have
>>>>> in-house.  It's geared more towards a business user
>>>>> than a developer, but the ideas of document collections
>>>>> and the development cycle are equally applicable.
>>>>> If there was enough interest here, I think that
>>>>> would be a good direction to go in.
>>>> Yes for me it sounds like the right way.
>>>> We could also use it for debugging an AE, then
>>>> a user defines a debug configuration and adds
>>>> the collection as document source.
>>> How would you define the format of a document collection ?
>>> To open a CAS document the document itself and a type system
>>> for the document is needed.
>>> In the Cas Editor right now an Input Collection is a Corpus folder which
>>> contains xmi/xcas files
>>> in one directory together with the project type system the files can be
>>> loaded by UIMA. Though
>>> it has be criticized for not allowing sub directories for structuring
>>> its documents.
>>> Jörn
>> That's perfectly fine, we do this in a similar way.
>> What would be good though is to distinguish between
>> text documents and "CAS documents" (be they XCAS, XMI
>> or some other format).  So you could start your work
>> by importing some text documents, then annotate them
>> in various ways (manually, or with coded annotators).
>> The CASes would reside in a different folder, and you
>> could derive any number of CAS collections from the
>> same set of source text documents.  We find that way
>> of working very convenient.
> We could reuse the code which is in the Cas Editor right
> now and move it into a new plugin which provides the document
> collections and type system to other plugins.
> The Cas Editor should be independent of the project model because
> people who use the Cas Editor do not necessarily want to it.

+1, couldn't agree more.  In fact, I would like to integrate
the CAS editor into our tooling, that would be a good test
case how independent it is.  I don't know when I'll get around
to playing with that, but it's definitely on my to do list.


> Jörn

View raw message