uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Klügl <pklu...@uni-wuerzburg.de>
Subject Re: TextMarker
Date Mon, 03 Jan 2011 15:00:03 GMT
Hi Thilo,

Am 01.01.2011 13:41, schrieb Thilo Goetz:
> Hi Peter,
>
> I downloaded the source trunk and got things mostly to compile
> and run: I'm running Eclipse 3.5.2, RCP edition, and installed
> the latest UIMA plugins and DLTK 1.0.2.  I also had to find the
> Mozilla xpcom plugin.  The only thing not compiling for me are
> references to com.sun.org.apache.apache.xpath.XPathAPI.  The
> internet tells me that those could be fixed by using Xalan
> directly, but I haven't tried.
>

The XPCom plugin is only necessary for the HTML visualization of the CEV 
plugin. The XULRunner plugin provides the implementations of the 
interfaces for the manipulation of the DOM within Eclipse. Both plugins 
often cause problems, but I haven't found a better solution yet.

About the XML problem: Which plugin has that reference? I've had a 
similar problem about three year ago, but that should be solved. 
However, I'm not an expert of the different XML integrations in Java. 
The only place in my code, if I'm not mistaken, where XML is actively 
used, is the engine project that is able to load dictionaries in 
trie-like structures. But that should work just fine without additional 
libraries. Can you give me more information about that problem?

> My main issue right now is that the TextMarker wiki is down,
> and that seems to be the only source of documentation (unless
> I missed something).

I'm sorry about that. My colleagues moved the wiki to a new server that 
is not as stable as expected. We will fix that ASAP. The wiki is still 
the only bit of documentation that currently exists.

>
> I noticed that TextMarker uses a lot of 3rd party libraries.
> So we'll need to compile an exhaustive list of the the libs
> that are being used, their licenses and provenance, and in
> case the license is bad, possible alternatives.
>

I'm willing to reduce the usage or exchange any 3rd party library if 
possible.

The most important dependencies are the UIMA-runtime plugin, the 
Eclipse-plugins (core, ui...), the plugins of the DLTK-Core framework 
and ANTLR (used for the AST in the IDE and for interpreting the rules in 
the analysis engines). The optional HTML extension of the CEV plugin 
uses an html-parser additional to the XPCom dependency.

There are only historical reasons why some plugins were hosted on 
SourceForge and they are not part of the TextMarker system. I have 
removed them now:
de.uniwue.tm.cas.converter
de.uniwue.tm.old.OfficeConverter
de.uniwue.tm.textmarker.uutuc


Peter


> --Thilo
>
> On 12/14/2010 15:55, Peter Klügl wrote:
>> Hello,
>>
>> We would like to contribute our TextMarker system to Apache UIMA and
>> want to ask, if the development team is interested in this contribution.
>> The system is currently hosted on SourceForge
>> (http://sourceforge.net/projects/textmarker/) and there is some
>> documentation in the project wiki
>> (http://tmwiki.informatik.uni-wuerzburg.de/).
>>
>> I think it's a good start for that discussion, if I summarize the
>> current status of the system. TextMarker is an Eclipse-based tool
>> implemented in pure Java that can among other things be used to
>> prototype analysis engines or develop complex handcrafted text
>> processing applications. It consists of four major parts:
>>
>> Language:
>> The rule or rather script language can be compared to regular
>> expressions over annotation with additional conditions and actions.
>> There are currently 28 different conditions and 34 actions. They range
>> from a test on a feature value to a test, if the matched annotation is
>> contained in another annotation of a given type, respectively from
>> creating an annotation to applying an external dictionary or analysis
>> engine. A TextMarker script can import type systems or define new types
>> or variables. Then, there are also some more complex control structures
>> for procedure calls, conditioned statements or recursion. The TextMarker
>> language (and inference) is in active usage in some productive
>> applications here, but it lacks of test cases. However, we are currently
>> writing uimaFIT based component test to improve the quality management.
>>
>> Workbench:
>> The Eclipse-based tool for developing the TextMarker scripts is
>> currently based on DLTK 1.0 (http://www.eclipse.org/dltk/) and it's
>> editor supports syntax highlighting, syntax checks, context-sensitive
>> auto-completion, formatting, mark occurrences, open declaration and some
>> other useful stuff commonly known in IDEs. For each script file, a type
>> system and an executable analysis engine is created. Therefore, it's
>> quite simple and efficient to create an analysis engine with a few lines
>> of TextMarker rules. The workbench supports testing on annotated xmiCas
>> while writing new rules and provides some minimal debugging
>> functionality that explains why and on what text a rule was executed.
>>
>> CEV:
>> This plugin can be used to edit or visualize xmiCAS and is also able to
>> render HTML. It is heavily used by the testing and explanation 
>> components.
>>
>> TextRuler:
>> This framework for rule learning is rather a playground and mainly
>> implemented by students. There are currently more or less working
>> implementations of LP2, WHISK, WIEN, RAPIER and an own algorithm, and
>> three other algorithms are being implemented.
>>
>>
>> Overall, the system is working stable for a year now, but lacks in code
>> quality, documentation and test cases. Basically, we are also willing to
>> change the name of the system, if someone can think of a better one.
>>
>> I'm looking forward to your comments.
>>
>> Best regards,
>>
>> Peter
>>
>>


Mime
View raw message