tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: Tika 1.15
Date Tue, 02 May 2017 12:04:49 GMT
The other two critical files:

Content/common_token_comparisons_by_mime.xlsx
Content/content_diffs_ignore_exceptions.xlsx


Oh, and the key part, which is less than ideal, is that there has to be a human in the loop...which
makes the need for visualizations even more critical.

For example:

1) We now have more exceptions in file type y.  Well, that's ok because we didn't have a parser
for file type y before.  

2) We have fewer exceptions in file type x; that should be good, right?  Well, no, because
now there are far fewer "common words" in x, which means that the parser became less restrictive
and sloppier.  We now have more noise.

3) We now have more "common words" in file type x; that should be a sign of improvement, right?
 Not necessarily, because:
	a) we failed to remove a few common html markup terms and our html parser/detection is failing
so we have a bunch more "span" and "body" words.  That's bad.  (We can fix this as we go forward)
	b) our parsers are repeating sections now.  Doh! (We can fix this with better statistics).
	c) our OCR is hallucinating common words because we're using a heavily dictionary-biased
OCR system.  (unlikely, but possible)

The lists go on...

In short, my original vision of nightly automated tests has had a run in with reality and
lost.  A human has to make sense of the output/db.

My dumping some reports to xlsx yields good data for the developer who wrote the code, but,
I agree, they are largely incomprehensible to someone getting started.

So, please, help!



-----Original Message-----
From: Tyler Bui-Palsulich [mailto:tpalsulich@apache.org] 
Sent: Monday, May 1, 2017 11:39 PM
To: dev@tika.apache.org
Subject: RE: Tika 1.15

How exactly did you "evaluate" the results? I opened the zip and looked at a few of the sheets,
but it's a bit daunting.

Any way we could dump JSON? That's a bit easier to build visualizations for.

Tyler

On May 1, 2017 3:59 PM, "Allison, Timothy B." <tallison@mitre.org> wrote:

> Sounds good.  W00t!
>
> -----Original Message-----
> From: Chris Mattmann [mailto:mattmann@apache.org]
> Sent: Monday, May 1, 2017 4:57 PM
> To: dev@tika.apache.org
> Subject: Re: Tika 1.15
>
> Thanks Tim. I am going to try and get tika-dl added (if possible), and 
> also try the Sentiment Parser next. If I can get one or both of those 
> (in the next day or so), then I will give you the heads up to begin testing.
> Video recognition is in!
>
>
>
>
>
> On 5/1/17, 12:42 PM, "Allison, Timothy B." <tallison@mitre.org> wrote:
>
>     I finally had a chance to look through the results of the first 
> regression run.
>
>     I made a few trivial changes to our parsers and to tika-eval.
>
>     We appear to have many more exceptions in files parsed by our 
> CompressorParser, but this is because of reporting...not because of 
> reality
> -- the exception is now coming in the container file, not an 
> attachment...and tika-eval wasn't matching A and B correctly.
>
>     There is a regression that's been fixed in PDFBox trunk 
> (PDFBOX-3717), but I don't see that as a blocker.
>
>     We have new exceptions in the new parsers, EMF, WMF, .xlsb, 
> wordperfect, but that's because we're actually parsing those now. :)
>
>     All else looks to be in decent shape.
>
>     Chris and Team and All,
>       Let me know when you're ready for me to kick off the next 
> regression run.
>
>               Cheers,
>
>                       Tim
>
>
>
>
>     -----Original Message-----
>     From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.nasa.gov]
>     Sent: Wednesday, April 26, 2017 12:48 PM
>     To: dev@tika.apache.org
>     Subject: Re: Tika 1.15
>
>     Thank you!
>
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>     Chris Mattmann, Ph.D.
>     Principal Data Scientist, Engineering Administrative Office (3010) 
> Manager, NSF & Open Source Projects Formulation and Development 
> Offices
> (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>     Office: 180-503E, Mailstop: 180-503
>     Email: chris.a.mattmann@nasa.gov
>     WWW:  http://sunset.usc.edu/~mattmann/
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>     Director, Information Retrieval and Data Science Group (IRDS) 
> Adjunct Associate Professor, Computer Science Department University of 
> Southern California, Los Angeles, CA 90089 USA
>     WWW: http://irds.usc.edu/
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>
>
>     On 4/26/17, 9:35 AM, "Allison, Timothy B." <tallison@mitre.org> wrote:
>
>         Oh.  Ok.  Will wait, then?
>
>         -----Original Message-----
>         From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.
> nasa.gov]
>         Sent: Wednesday, April 26, 2017 11:38 AM
>         To: dev@tika.apache.org
>         Subject: Re: Tika 1.15
>
>         I want to see if I can get in the VideoRecognition parser, and 
> also the Sentiment one.
>
>         I hope to get it done in the next day or so. Thanks.
>
>         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>         Chris Mattmann, Ph.D.
>         Principal Data Scientist, Engineering Administrative Office 
> (3010) Manager, NSF & Open Source Projects Formulation and Development 
> Offices
> (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>         Office: 180-503E, Mailstop: 180-503
>         Email: chris.a.mattmann@nasa.gov
>         WWW:  http://sunset.usc.edu/~mattmann/
>         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>         Director, Information Retrieval and Data Science Group (IRDS) 
> Adjunct Associate Professor, Computer Science Department University of 
> Southern California, Los Angeles, CA 90089 USA
>         WWW: http://irds.usc.edu/
>         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>
>
>         On 4/26/17, 7:54 AM, "Allison, Timothy B." 
> <tallison@mitre.org>
> wrote:
>
>             With the added TSD parser, I think I should rerun the 
> regression testing.  Given that, I also fixed 2099, and we'll benefit 
> from a rerun.
>
>             Anything else before I rerun the regression testing?
>
>             Any problems observed in first run?
>
>
>
>
>
>
>
>
>
Mime
View raw message