uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Eckart de Castilho <richard.eck...@gmail.com>
Subject Re: [jira] [Commented] (UIMA-2953) jcasgen-maven-plugin needs to support patterns
Date Tue, 02 Jul 2013 16:17:35 GMT
Following your recommendation on doing discussions on the list (not in Jira) and because I
think inline comments are appropriate here, I'll reply via the list.

Am 02.07.2013 um 17:22 schrieb Marshall Schor (JIRA) <dev@uima.apache.org>:

> Yes, I think this is to be something of a mutual education :-) - having not used uimaFIT
myself :-).
> First some minor editing:  I assume you mean in Case 1: …. one or more dedicated Maven
modules ... to be ... one or more dedicated UIMA modules ... ?  

What is an UIMA module? Is that an Eclipse project which is not a Maven module? I wrote "Maven
module" because the jcasgen-maven-plugin doesn't make sense on a non-Maven Eclipse project.

> Second, apologies for this long ramble...
> Some clarifications: 
>  * JCasGen always does a complete import of all things (AE descriptors, type system descriptors,
etc.) and uses core UIMA to form the merged type system.  All UIMA-1176 does is limit which
cover classes get generated from all of the defined types.

That is my understanding as well.

>  * I was thinking of the module as typically being an Eclipse "Project" which contained
both annotator implementation Java code, as well as an Analysis Engine XML Descriptor for
that which specified the externalized meta-data for the Annotator: the type system, the inputs/
outputs, the parameters, any special indexes, resources, etc.

Ok. That's probably the simplest case. Haven't seen that for a long time though ;)

> I was thinking there are 2 use cases - the uimaFIT style, where descriptors are not used
(?) and the UIMA style, using XML descriptors.  In the latter, type systems are put together
from collections of specifications.  These collections are done using imports, which occur
in two places (for type systems): within type system descriptions themselves, and also within
aggregate analysis descriptors, where the delegates are typically specified using imports.

uimaFIT uses descriptors for type systems. It also uses descriptors for pretty much everything
else, but it does so internally, mostly hiding the fact from the user. The descriptors are
dynamically generated and never serialized as XML (unless the user really wants that).

Regarding type systems, uimaFIT offers can automatically scan the classpath for type system
descriptors, internally merge them and provide them to any components created via uimaFIT.
It is not mandatory to do this, but extremely convenient. If used, this mechanism replaces
the explicit imports usually present in component descriptors.

> So the spec of what things go together is provided in the XML descriptors, via the import
statements.  If this exists, that's what I think the ant-like patterns would be replicating.
 If this doesn't exist (here I'm guessing - perhaps because the uimaFIT style isn't using
these descriptors), the ant-like pattern isn't replicating this.  

I still don't understand the connection between the ant-like patterns in the jcasgen-maven-plugin
(build time) and the import mechanism (runtime). Maybe you are trying to tell me that in a
plain-UIMA case, there is no need for the patterns, because there is typically a top-level
descriptor. Maybe, consequently, if the uimaFIT type-discovery mechanism already has a pattern
somewhere (e.g. in META-INF/org.apache.uima.fit/types.txt), then these patterns should be
used directly by the plugin instead of duplicating this feature with the ant-like patterns.
I'd still argue that the "types.txt" is quite different from the ant-like pattern in the jcasgen-maven-plugin,
because types.txt uses classpath-based patterns (which could actually leak across module boundaries),
while jcasgen-maven-plugin uses build-time paths.

> The use case that motivate UIMA-1176 was, as you thought, where people would create an
Eclipse project, and put annotator code plus an annotator analysis engine description (which,
in turn, would include a type system, which perhaps had imports of other type systems).  The
JCasGen operation would generate the source code and put it into this Eclipse project's sources,
so that when a Jar was made for this, it would include both the annotator implementation,
and the corresponding JCas classes (excluding imports of type systems not defined within this
project).  This seemed like a modular packaging.

I like this kind of packaging. Except that we usually put our types and engines into separate
modules (see below).

> This use-case corresponds to your "Case 2" I think.  UIMA-1176 helps case 2, where the
AE module might depend on other similarly packaged Analysis Engines, and would therefore get
their JCas cover classes by depending on these other projects (which would be defining them).

> In "Case 2" - if there were multiple top level descriptors, say for 2 configurations
of a particular AE, where the Types produced were different, I can see where one might want
to specify multiple top level descriptors to jcasgen-maven-plugin.  I haven't seen or heard
about this use case myself, however; it seems to me that when people design an annotator,
they have a single idea about what it's producing, and want to provide JCas cover classes
for that.  Have you seen this?

It appears that you assume a "module" always contains only a single AE, possibly in multiple
configurations. We have multiple AEs in a (Maven) module and sometimes type descriptors in
addition to those imported from what you call "model" modules.

In the past, I have seen people defining exactly one type per descriptor. Meanwhile I see
more one-descriptor-per-package, e.g. in "model" modules containing multiple packages. 

What's a top-level descriptor for you? An AE descriptor or a type system descriptor? For me,
for the purpose of the jcasgen-maven-plugin, the top-level descriptor(s) are type system descriptors.

> For large projects, I've seen "Case 1" style, where a group collectively decides to put
a large collection of types into a "shared" Eclipse project, which they might call their "model".
 I can imagine that there might *not* be a single common type system spec which specified
all of the type system descriptors, and how this could create the need for the build tool
to specify multiple top descriptors.

ClearTK has one such "model" module. I believe the same is true for cTakes. DKPro Core has
quite a lot of them, roughly one per "annotation layer", e.g. "syntax", "semantic role labelling",
"segmentation", etc. We call them "api" modules. Some of these contain more than one descriptor,
e.g. in a one-descriptor-per-package style.


-- Richard
View raw message