uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thilo Goetz <twgo...@gmx.de>
Subject Re: Document collections [was: Re: Building the eclipse update site]
Date Thu, 30 Jul 2009 09:28:14 GMT
Jörn Kottmann wrote:
> Thilo Goetz wrote:
>> Jörn Kottmann wrote:
>>> Thilo Goetz wrote:
>>>> Jörn Kottmann wrote:
>>>>> Jörn Kottmann wrote:
>>>>>>> A collection of text documents that you can run
>>>>>>> analysis on.  If I understand correctly, the Cas
>>>>>>> Editor currently requires XCAS/XmiCAS files.  It
>>>>>>> would be nice if users could just add their text
>>>>>>> files and then either create annotations manually
>>>>>>> with the Cas Editor, or automatically by running
>>>>>>> some analysis and then view the results using the
>>>>>>> Cas Editor.  Then we could add results comparison
>>>>>>> etc.  See
>>>>>>> http://dl.alphaworks.ibm.com/technologies/tap/text_analysis_perspective.pdf
>>>>>>> for a (outdated) description of what we have
>>>>>>> in-house.  It's geared more towards a business user
>>>>>>> than a developer, but the ideas of document collections
>>>>>>> and the development cycle are equally applicable.
>>>>>>> If there was enough interest here, I think that
>>>>>>> would be a good direction to go in.
>>>>>> Yes for me it sounds like the right way.
>>>>>> We could also use it for debugging an AE, then
>>>>>> a user defines a debug configuration and adds
>>>>>> the collection as document source.
>>>>> How would you define the format of a document collection ?
>>>>> To open a CAS document the document itself and a type system
>>>>> for the document is needed.
>>>>> In the Cas Editor right now an Input Collection is a Corpus folder
>>>>> which
>>>>> contains xmi/xcas files
>>>>> in one directory together with the project type system the files
>>>>> can be
>>>>> loaded by UIMA. Though
>>>>> it has be criticized for not allowing sub directories for structuring
>>>>> its documents.
>>>>> Jörn
>>>> That's perfectly fine, we do this in a similar way.
>>>> What would be good though is to distinguish between
>>>> text documents and "CAS documents" (be they XCAS, XMI
>>>> or some other format).  So you could start your work
>>>> by importing some text documents, then annotate them
>>>> in various ways (manually, or with coded annotators).
>>>> The CASes would reside in a different folder, and you
>>>> could derive any number of CAS collections from the
>>>> same set of source text documents.  We find that way
>>>> of working very convenient.
>>> We could reuse the code which is in the Cas Editor right
>>> now and move it into a new plugin which provides the document
>>> collections and type system to other plugins.
>>> The Cas Editor should be independent of the project model because
>>> people who use the Cas Editor do not necessarily want to it.
>> +1, couldn't agree more.  In fact, I would like to integrate
>> the CAS editor into our tooling, that would be a good test
>> case how independent it is.  I don't know when I'll get around
>> to playing with that, but it's definitely on my to do list.
> Ok, then lets split the Cas Editor into the editing part and project
> model. For the project model part we have to create a new eclipse
> project, e.g. uimaj-ep-base. The remaining Cas Editor should be independent
> of the project model which means uimaj-ep-base depends on the Cas Editor
> (to add the
> document provider extension to it).
> After we are done with that, we can look into uimaj-ep-base and see how
> it fits our needs
> and how it can be used by the other eclipse based tooling.
> Jörn

+1, that would be great.  I would like to be able to use
the CAS Editor sometimes not to edit CASes, but to simply
display them.  For example, you could imagine some Eclipse
tooling that runs UIMA analysis and displays the results
in the CAS Editor, without first materializing the CAS on
disk.  So with your proposed changes, I should be able to
do that, right?


View raw message