tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: Regression Testing
Date Tue, 08 Jul 2014 00:47:12 GMT

   My initial plan for TIKA-1302 is very similar to what Tilman outlined, and my understanding/concerns/thoughts
were very much in line with what he articulated.  The idea is that there should be a small
Apache license-able gold truth set like both projects now have for specific unit tests (patient-based
care), but that we should also occasionally take a public-health view and compare the outputs
of  different versions of our parsers on a large set of docs to identify new exceptions or
large changes in extracted content/metadata. 

   I'm persuaded by your points about fair use and the importance of "open data."  Before
proceeding on TIKA-1302, I'd like to get broader feedback on the way ahead via legal-discuss
or maybe jira's Legal.  Do you mind if I quote your arguments?

   Also, I was on my way to requesting a vm from infra for TIKA-1302.  Do you see any way
that we could share resources so that we're not double-storing files on Apache infrastructure?
 There may be easy ways to share some eval code as well.



-----Original Message-----
From: John Hewson [mailto:john@jahewson.com] 
Sent: Saturday, July 05, 2014 5:01 PM
To: dev@pdfbox.apache.org
Subject: Re: Regression Testing

On 5 Jul 2014, at 13:47, Tilman Hausherr <THausherr@t-online.de> wrote:

> Am 05.07.2014 22:12, schrieb John Hewson:
>>>>> Copyrights is a problem: I'm testing mostly with JIRA attachments that
I've downloaded over the years. While uploading such files to JIRA might count as fair use,
I doubt that this would still be true if they are included in a distribution. Instead, they
should be stored somewhere on Apache servers where only committers and build software ("Travis",
"Jenkins", ...) can access then. The public PDFs that Maruan mentions don't possibly have
all the Problem cases that we solved before. However I have started working with these files
and there are at least 5 recent issues that deals with them.
>>>> The PDFs won't be in a distribution. They will just happen to be stored in
an SVN repo but not our source code repo, in the same way that the website is stored in the
"cmssite" branch of SVN or indeed, are on JIRA. The law doesn't distinguish between JIRA and
SVN, both are publicly available via HTTP, so using SVN will simply be a continuation of what
we're already doing with JIRA.
>>>> The crucial factor is that we're only storing publicly available PDFs,  because
we have the right to do so, just like Google's cache, and like we currently do with JIRA.
>>> Yes but many PDFs we got aren't really "public". If this svn repo is only accessible
to committers, and if the publicly available build scripts won't break because of this, then
it is OK.
>> Any non-public PDFs will not be permitted in our test suite, just as they shouldn't
be on JIRA.
>>> Note that even if something is "publicly available", it may still be copyrighted.
Other risks can be that some people upload PDFs that include personal data. One really good
test PDF was apparently a loan application. I remember that the user insisted that 1. it was
test data, and 2. that it be removed.
>> All Apache development should be in the open, this is a key ASF principle, having
a committers-only test suite is basically a no-no. It's important to understand that "fair
use" allows us to use copyrighted works - this is expressly permitted, it's the same legal
principle as Google's cache. There is no need to seek permission. This is what we've been
doing with JIRA already for years, so we are already doing this - it's fine.
> The problem is that this has all happened before. A few years ago, many files were deleted,
see PDFBOX-391.

That issue is about including files in the source code repo as part of the PDFBox distribution,
where there is a need to put files under an Apache 2.0 compatible license. What I'm advocating
is keeping a separate public repository of test files which are not a part of the PDFBox source,
like we currently have on JIRA.

-- John

View raw message