tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: [VOTE] Apache Tika 1.11 Release Candidate #1
Date Wed, 21 Oct 2015 13:42:40 GMT
+0 (some regressions in ppt content)

I just finished the batch comparison run on  ~1.8 million files in our govdocs1 and commoncrawl
corpora comparing Tika 1.10 to 1.11-rc1.  As a caveat, the eval code is still in development
and there may be bugs in the reports.

Results are here: https://github.com/tballison/share/blob/master/tika_comparisons/tika_1_10_vs_1_11-rc1.zip


Key reports:
contents/content_diffs.csv (file had one corrupt row when viewing in Excel...manually deleted
offending content)
exceptions/newExceptionsInBByMimeTypeByStackTrace.csv (small handful)
exceptions/fixedExceptionsInBByMimeType.csv  (none!)
mimes/mime_diffs_A_to_B.csv

On the positive side:
From "mime_diffs_A_to_B.csv", it looks like we are catching more pdfs as pdfs (that text/xhtml)
than we were...great!  We're identifying more files as images (jpeg, pict) than as xhtml,
and, from a quick look, this appears to be an improvement.  We have at least 9 new x-hwp-v5
(great!).

On the negative side:

1) We have a few regressions in ppt exceptions (six of the same aioobe).
2) We have regressions in ppt content (it looks like we're not adding a new line/word break
where we need to).  The regressions are small per file, but they affect ~220 ppts out of ~1500
(~15%). 

Other than the regressions in ppt content, I'd be +1, but I don't think this is severe enough
to warrant a re-spin.  Happy to look into a fix, though, if we want a re-spin...and even if
we don't, I'll start looking into this asap.

-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov] 
Sent: Monday, October 19, 2015 10:23 AM
To: dev@tika.apache.org
Cc: user@tika.apache.org
Subject: [VOTE] Apache Tika 1.11 Release Candidate #1

Hi Folks,

A first candidate for the Tika 1.11 release is available at:

  https://dist.apache.org/repos/dist/dev/tika/

The release candidate is a zip archive of the sources in:
  http://svn.apache.org/repos/asf/tika/tags/1.11-rc1/

The SHA1 checksum of the archive is
d0dde7b3a4f1a2fb6ccd741552ea180dddab630a

In addition, a staged maven repository is available here:

https://repository.apache.org/content/repositories/orgapachetika-1014/


Please vote on releasing this package as Apache Tika 1.11.
The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika
PMC votes are cast.

[ ] +1 Release this package as Apache Tika 1.11 [ ] -1 Do not release this package becauseā€¦

Cheers,
Chris

P.S. Of course here is my +1.



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory
Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department University of Southern California,
Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



Mime
View raw message