uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <kottm...@gmail.com>
Subject Re: Guidelines for a mutual contribution
Date Tue, 31 May 2011 21:02:19 GMT
On 5/19/11 3:04 PM, Nicolas Hernandez wrote:
> Hello Everyone
> Jörn, yes it (training MaxEnt models for OpenNLP from the French
> Treebank) is actually part of our plan (building a French-Speaking
> UIMA Community). We wanted also to contribute to the OpenNLP project
> since no models was available for French processing!

That would be really nice, we would love to get this contribution. Over
at OpenNLP we have special parser for different corpus formats, maybe we can
integrate your parsing code for the French corpus there. Then we can 
train our components
on this data.

OpenNLP also comes with an UIMA integration, which makes all components 
(expect coref)
available to UIMA applications.

> About  the right to train models on this data set and then distribute
> them under Apache License 2: It took time for us to get the right to
> do it, but I think it was because we were the first to ask for. Now
> they know about it. I know that the maltparser team
> (http://maltparser.org/) would be also interested by the grant. You
> may ask for the French Treebank authors. I can also ask them for
> letting an explicit mention about the right to do it on their web
> site.
It would be nice if the data can be shared between the OpenNLP developers,
and we of course would need to distribute the model under AL 2.0, but
just having the integration code would also be nice, because it looks like
everyone can easily get a copy of that corpus.

> As far as I know, the data training set for the English and German POS
> models are not freely available, are they ?
> Eventually, Jörn, I m not sure to understand. Do you think the IP
> clearance process is not adapted for submitting our contribution ?

In the end the big issue for UIMA developers is that we cannot retrain the
tagger without having the training data. Not be able to retrain means,
that we are not able to change the code much. So in the end that is a big
issue for us. I dealt with this kind of issues a lot over at OpenNLP and it
is just painful.

And even when we follow your description to retrain the models, do we
then have the right to publish it under AL 2.0 or do we need to go 
through IP clearance
again, because the AL 2.0 publishing is an agreement between you and the 
corpus copyright holder?

I think having an agreement which grants rights to the ASF to publish 
the models trained on the
data under AL 2.0 is what we need, and it would also be nice if every 
committer in the team could
get access to the data.

Anyway, hope to see you over at the OpenNLP mailing list.


View raw message