tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Paulin <...@bobpaulin.com>
Subject Re: [VOTE] Apache Tika 1.11 Release Candidate #1
Date Fri, 23 Oct 2015 23:27:22 GMT
+1 - All projects build, test pass, OSGi bundle loads, checksums match.

- Bob

On 10/22/2015 8:49 PM, Tyler Palsulich wrote:
> +1 from me -- builds, tests pass, sanity check files parse, and sums look
> good. But, I get a warning that the signature is not certified with a
> trusted signature.
>
> Tyler
>
> On Wed, Oct 21, 2015 at 6:43 AM Allison, Timothy B. <tallison@mitre.org>
> wrote:
>
>> +0 (some regressions in ppt content)
>>
>> I just finished the batch comparison run on  ~1.8 million files in our
>> govdocs1 and commoncrawl corpora comparing Tika 1.10 to 1.11-rc1.  As a
>> caveat, the eval code is still in development and there may be bugs in the
>> reports.
>>
>> Results are here:
>> https://github.com/tballison/share/blob/master/tika_comparisons/tika_1_10_vs_1_11-rc1.zip
>>
>> Key reports:
>> contents/content_diffs.csv (file had one corrupt row when viewing in
>> Excel...manually deleted offending content)
>> exceptions/newExceptionsInBByMimeTypeByStackTrace.csv (small handful)
>> exceptions/fixedExceptionsInBByMimeType.csv  (none!)
>> mimes/mime_diffs_A_to_B.csv
>>
>> On the positive side:
>>  From "mime_diffs_A_to_B.csv", it looks like we are catching more pdfs as
>> pdfs (that text/xhtml) than we were...great!  We're identifying more files
>> as images (jpeg, pict) than as xhtml, and, from a quick look, this appears
>> to be an improvement.  We have at least 9 new x-hwp-v5 (great!).
>>
>> On the negative side:
>>
>> 1) We have a few regressions in ppt exceptions (six of the same aioobe).
>> 2) We have regressions in ppt content (it looks like we're not adding a
>> new line/word break where we need to).  The regressions are small per file,
>> but they affect ~220 ppts out of ~1500 (~15%).
>>
>> Other than the regressions in ppt content, I'd be +1, but I don't think
>> this is severe enough to warrant a re-spin.  Happy to look into a fix,
>> though, if we want a re-spin...and even if we don't, I'll start looking
>> into this asap.
>>
>> -----Original Message-----
>> From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>> Sent: Monday, October 19, 2015 10:23 AM
>> To: dev@tika.apache.org
>> Cc: user@tika.apache.org
>> Subject: [VOTE] Apache Tika 1.11 Release Candidate #1
>>
>> Hi Folks,
>>
>> A first candidate for the Tika 1.11 release is available at:
>>
>>    https://dist.apache.org/repos/dist/dev/tika/
>>
>> The release candidate is a zip archive of the sources in:
>>    http://svn.apache.org/repos/asf/tika/tags/1.11-rc1/
>>
>> The SHA1 checksum of the archive is
>> d0dde7b3a4f1a2fb6ccd741552ea180dddab630a
>>
>> In addition, a staged maven repository is available here:
>>
>> https://repository.apache.org/content/repositories/orgapachetika-1014/
>>
>>
>> Please vote on releasing this package as Apache Tika 1.11.
>> The vote is open for the next 72 hours and passes if a majority of at
>> least three +1 Tika PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Tika 1.11 [ ] -1 Do not release this
>> package becauseā€¦
>>
>> Cheers,
>> Chris
>>
>> P.S. Of course here is my +1.
>>
>>
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398) NASA Jet
>> Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department University of
>> Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>


Mime
View raw message