uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <kottm...@gmail.com>
Subject Re: Document collections [was: Re: Building the eclipse update site]
Date Thu, 30 Jul 2009 08:59:13 GMT
Thilo Goetz wrote:
> Jörn Kottmann wrote:
>   
>> Thilo Goetz wrote:
>>     
>>> Jörn Kottmann wrote:
>>>  
>>>       
>>>> Jörn Kottmann wrote:
>>>>    
>>>>         
>>>>>> A collection of text documents that you can run
>>>>>> analysis on.  If I understand correctly, the Cas
>>>>>> Editor currently requires XCAS/XmiCAS files.  It
>>>>>> would be nice if users could just add their text
>>>>>> files and then either create annotations manually
>>>>>> with the Cas Editor, or automatically by running
>>>>>> some analysis and then view the results using the
>>>>>> Cas Editor.  Then we could add results comparison
>>>>>> etc.  See
>>>>>> http://dl.alphaworks.ibm.com/technologies/tap/text_analysis_perspective.pdf
>>>>>>
>>>>>>
>>>>>> for a (outdated) description of what we have
>>>>>> in-house.  It's geared more towards a business user
>>>>>> than a developer, but the ideas of document collections
>>>>>> and the development cycle are equally applicable.
>>>>>> If there was enough interest here, I think that
>>>>>> would be a good direction to go in.
>>>>>>           
>>>>>>             
>>>>> Yes for me it sounds like the right way.
>>>>> We could also use it for debugging an AE, then
>>>>> a user defines a debug configuration and adds
>>>>> the collection as document source.
>>>>>       
>>>>>           
>>>> How would you define the format of a document collection ?
>>>>
>>>> To open a CAS document the document itself and a type system
>>>> for the document is needed.
>>>>
>>>> In the Cas Editor right now an Input Collection is a Corpus folder which
>>>> contains xmi/xcas files
>>>> in one directory together with the project type system the files can be
>>>> loaded by UIMA. Though
>>>> it has be criticized for not allowing sub directories for structuring
>>>> its documents.
>>>>
>>>> Jörn
>>>>     
>>>>         
>>> That's perfectly fine, we do this in a similar way.
>>> What would be good though is to distinguish between
>>> text documents and "CAS documents" (be they XCAS, XMI
>>> or some other format).  So you could start your work
>>> by importing some text documents, then annotate them
>>> in various ways (manually, or with coded annotators).
>>> The CASes would reside in a different folder, and you
>>> could derive any number of CAS collections from the
>>> same set of source text documents.  We find that way
>>> of working very convenient.
>>>       
>> We could reuse the code which is in the Cas Editor right
>> now and move it into a new plugin which provides the document
>> collections and type system to other plugins.
>>
>> The Cas Editor should be independent of the project model because
>> people who use the Cas Editor do not necessarily want to it.
>>     
>
> +1, couldn't agree more.  In fact, I would like to integrate
> the CAS editor into our tooling, that would be a good test
> case how independent it is.  I don't know when I'll get around
> to playing with that, but it's definitely on my to do list.
>   
Ok, then lets split the Cas Editor into the editing part and project
model. For the project model part we have to create a new eclipse
project, e.g. uimaj-ep-base. The remaining Cas Editor should be independent
of the project model which means uimaj-ep-base depends on the Cas Editor 
(to add the
document provider extension to it).

After we are done with that, we can look into uimaj-ep-base and see how 
it fits our needs
and how it can be used by the other eclipse based tooling.

Jörn

Mime
View raw message