tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: Tika 1.15
Date Wed, 17 May 2017 10:50:26 GMT
Full report on attachment # diffs: http://162.242.228.174/reports/attachment_diffs_complete_20170516.xlsx

Still need to look through contents diffs.

-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org] 
Sent: Tuesday, May 16, 2017 3:11 PM
To: dev@tika.apache.org
Subject: RE: Tika 1.15

I reran the eval with some updates, including rc1 of PDFBox 2.0.6, which is now integrated.

http://162.242.228.174/reports/reports_tika_20170515.tar.gz

I need to do some more digging on attachments -- hit max limit.  The decrease in attachments
from the few docs I reviewed is explained by change in default behavior of macro extraction
-- in 1.14 we were extracting macros by default, but we aren't doing this in 1.15.  However,
I want to look at more than the first x diffs because there may be other file formats further
down the results that weren't included in the report.

I also want to look at the contents...haven't had a chance.

>     On May 1, 2017 3:59 PM, "Allison, Timothy B." <tallison@mitre.org>
> wrote:
>
>     > Sounds good.  W00t!
>     >
>     > -----Original Message-----
>     > From: Chris Mattmann [mailto:mattmann@apache.org]
>     > Sent: Monday, May 1, 2017 4:57 PM
>     > To: dev@tika.apache.org
>     > Subject: Re: Tika 1.15
>     >
>     > Thanks Tim. I am going to try and get tika-dl added (if 
> possible), and
>     > also try the Sentiment Parser next. If I can get one or both of those
>     > (in the next day or so), then I will give you the heads up to 
> begin testing.
>     > Video recognition is in!
>     >
>     >
>     >
>     >
>     >
>     > On 5/1/17, 12:42 PM, "Allison, Timothy B." <tallison@mitre.org>
> wrote:
>     >
>     >     I finally had a chance to look through the results of the first
>     > regression run.
>     >
>     >     I made a few trivial changes to our parsers and to tika-eval.
>     >
>     >     We appear to have many more exceptions in files parsed by our
>     > CompressorParser, but this is because of reporting...not because of
>     > reality
>     > -- the exception is now coming in the container file, not an
>     > attachment...and tika-eval wasn't matching A and B correctly.
>     >
>     >     There is a regression that's been fixed in PDFBox trunk
>     > (PDFBOX-3717), but I don't see that as a blocker.
>     >
>     >     We have new exceptions in the new parsers, EMF, WMF, .xlsb,
>     > wordperfect, but that's because we're actually parsing those now. :)
>     >
>     >     All else looks to be in decent shape.
>     >
>     >     Chris and Team and All,
>     >       Let me know when you're ready for me to kick off the next
>     > regression run.
>     >
>     >               Cheers,
>     >
>     >                       Tim
>     >
>     >
>     >
>     >
>     >     -----Original Message-----
>     >     From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.
> nasa.gov]
>     >     Sent: Wednesday, April 26, 2017 12:48 PM
>     >     To: dev@tika.apache.org
>     >     Subject: Re: Tika 1.15
>     >
>     >     Thank you!
>     >
>     >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >     Chris Mattmann, Ph.D.
>     >     Principal Data Scientist, Engineering Administrative Office
> (3010)
>     > Manager, NSF & Open Source Projects Formulation and Development
>     > Offices
>     > (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>     >     Office: 180-503E, Mailstop: 180-503
>     >     Email: chris.a.mattmann@nasa.gov
>     >     WWW:  http://sunset.usc.edu/~mattmann/
>     >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >     Director, Information Retrieval and Data Science Group (IRDS)
>     > Adjunct Associate Professor, Computer Science Department 
> University of
>     > Southern California, Los Angeles, CA 90089 USA
>     >     WWW: http://irds.usc.edu/
>     >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >
>     >
>     >     On 4/26/17, 9:35 AM, "Allison, Timothy B." <tallison@mitre.org>
> wrote:
>     >
>     >         Oh.  Ok.  Will wait, then?
>     >
>     >         -----Original Message-----
>     >         From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.
>     > nasa.gov]
>     >         Sent: Wednesday, April 26, 2017 11:38 AM
>     >         To: dev@tika.apache.org
>     >         Subject: Re: Tika 1.15
>     >
>     >         I want to see if I can get in the VideoRecognition parser,
> and
>     > also the Sentiment one.
>     >
>     >         I hope to get it done in the next day or so. Thanks.
>     >
>     >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >         Chris Mattmann, Ph.D.
>     >         Principal Data Scientist, Engineering Administrative Office
>     > (3010) Manager, NSF & Open Source Projects Formulation and 
> Development
>     > Offices
>     > (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>     >         Office: 180-503E, Mailstop: 180-503
>     >         Email: chris.a.mattmann@nasa.gov
>     >         WWW:  http://sunset.usc.edu/~mattmann/
>     >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >         Director, Information Retrieval and Data Science Group (IRDS)
>     > Adjunct Associate Professor, Computer Science Department 
> University of
>     > Southern California, Los Angeles, CA 90089 USA
>     >         WWW: http://irds.usc.edu/
>     >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >
>     >
>     >         On 4/26/17, 7:54 AM, "Allison, Timothy B."
>     > <tallison@mitre.org>
>     > wrote:
>     >
>     >             With the added TSD parser, I think I should rerun the
>     > regression testing.  Given that, I also fixed 2099, and we'll benefit
>     > from a rerun.
>     >
>     >             Anything else before I rerun the regression testing?
>     >
>     >             Any problems observed in first run?
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>
>
>
>
Mime
View raw message