uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Eckart de Castilho <...@apache.org>
Subject Re: [VOTE] Release Apache UIMA Ruta 2.4.0 RC3
Date Mon, 08 Feb 2016 21:26:42 GMT
The problem I see is that we currently do not know where the file comes from
(provenance). I find it hard to believe that the file was an original creation
from Stefan. I believe that it could take quite some time to compile such a
list of names. More likely is in my opinion, that the file was obtained from
some third-party source. 

If we knew that third-party source, we might easily be able to clear IP.

Since we do not know it, we currently have to resort to speculation about the
lawfulness of compiling specialized unigram lists.

It looks like we can agree this is not a blocker for the present release as
involved risk is apparently very low. Still, we should try to clear this.

I've placed a comment on UIMA-3926 asking Stefan to shed some light on the
provenance of the file. Let's see what comes of it.

Thanks for digging up the issue number Marschall!


-- Richard

> On 08.02.2016, at 21:56, Marshall Schor <msa@schor.com> wrote:
> So, first I'd like to summarize, in case I don't fully understand the issue.
> Ruta contains some examples; the example data include 90K file FirstNames.txt,
> in example-projects/GermanNovels/reosources.
> From what I can see, there are no actual German Novels included in the
> example-project/GermanNovels.
> From the discussion, it seems the word lists were not originally part of the
> contribution; but a comment in UIMA-3926 Peter asks if the word list could be
> contributed, but not the novels, and Stefan then contributed them.
> I am not a lawyer, so this is not a legal opinion, but I did a quick internet
> search and believe that compiling a list of words used in a novel does not
> infringe the copyright in that novel, because this data is entirely independent
> of the expressive value of any of the underlying sources that might have been
> used to compile the list; and the list has lost any similarity to the underlying
> sources in terms of things like plot, theme, etc.
> So I think the risk is low.  We could probably reduce the risk by asking Stephan
> where these lists came from, and if he is aware of any IP issues with them.
> To the extent that we collect information and form opinions on issues like this,
> I recommend adding a file to the SVN, not necessarily included in the build,
> called something like license-notice-research.txt, just to record these things
> in one place, so we can find it quickly if a question comes up later and we want
> to remember what and why we did something.
> -Marshall
> On 2/8/2016 5:21 AM, Richard Eckart de Castilho wrote:
>> On 08.02.2016, at 11:11, Peter Klügl <peter.kluegl@averbis.com> wrote:
>>> Am 08.02.2016 um 10:44 schrieb Richard Eckart de Castilho:
>>>> On 08.02.2016, at 10:11, Peter Klügl <peter.kluegl@averbis.com> wrote:
>>>>> Hi,
>>>>> Am 07.02.2016 um 19:52 schrieb Richard Eckart de Castilho:
>>>>>> Checks:
>>>>>> - compared POMs in 2.3.0 svn tag against 2.4.0 tag: no new dependencies
- OK
>>>>>> - the FirstNames.txt file in GermanNovels is quite large 90k, but
no source info/license for this file is given anywhere: doesn't seem OK
>>>>>> - stopping checks at this point for the moment
>>>>> What kind of source info/license would you expect? The file together
>>>>> with the other files was contributed as part of UIMA-3926 with an ICLA
>>>>> present. I do not remember if I knew the source of the file by then,
>>>>> I remember that I had some conversations with the contributor that the
>>>>> files need to be OK for a contribution. That's the reason why the
>>>>> test/dev data was not contributed since it had some CC license that was
>>>>> problematic.
>>>> The other dev/test data doesn't seem problematic at all, but the 90k names
>>>> file seems non-trivial. If it were CC, the license would need to be mentioned
>>>> in a LICENSE.txt file. My suggestion would be to simply strip the file down
>>>> to the names needed for the example.
>>> If I have to guess I'd say that the names have been crawled and that
>>> there is no original source file with a specific license.
>>> The novels had the CC license last time I checked. I do not remember
>>> all, but when I looked it up in Apache's third party pages, it indicated
>>> that it was not possible to include them. However, I could have been wrong.
>>> Hmm... it depends what is needed for the example. The initial example
>>> were 10-20 novels. I could strip it down to the firstnames of one novel
>>> I remember to be part of the dev set, but is that really necessary?
>> Let's see what Marshall thinks about it.
>> -- Richard

View raw message