tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: Tika 1.15
Date Tue, 02 May 2017 11:54:25 GMT
Y.  It is daunting at this point, and please do help!

The key sheets I look at:

exceptions/exceptions_compared_by_mime_type.xlsx
exceptions/new_exceptions_in_B_by_mime.xlsx

mimes/mime_diffs_A_to_B.xlsx

attachments/attachment_diffs.xlsx

metadata/metadata_value_count_diffs.xlsx

I can dump json, but wouldn't it be easier for you to pull directly from the db?

My vision is to put a gui on the db that would allow you to visualize the reports/see the
data and have links to the original (binary) files plus the extract files for both A and B
(perhaps with a diff visualization).

Three cheers for d3.


-----Original Message-----
From: Tyler Bui-Palsulich [mailto:tpalsulich@apache.org] 
Sent: Monday, May 1, 2017 11:39 PM
To: dev@tika.apache.org
Subject: RE: Tika 1.15

How exactly did you "evaluate" the results? I opened the zip and looked at a few of the sheets,
but it's a bit daunting.

Any way we could dump JSON? That's a bit easier to build visualizations for.

Tyler

On May 1, 2017 3:59 PM, "Allison, Timothy B." <tallison@mitre.org> wrote:

> Sounds good.  W00t!
>
> -----Original Message-----
> From: Chris Mattmann [mailto:mattmann@apache.org]
> Sent: Monday, May 1, 2017 4:57 PM
> To: dev@tika.apache.org
> Subject: Re: Tika 1.15
>
> Thanks Tim. I am going to try and get tika-dl added (if possible), and 
> also try the Sentiment Parser next. If I can get one or both of those 
> (in the next day or so), then I will give you the heads up to begin testing.
> Video recognition is in!
>
>
>
>
>
> On 5/1/17, 12:42 PM, "Allison, Timothy B." <tallison@mitre.org> wrote:
>
>     I finally had a chance to look through the results of the first 
> regression run.
>
>     I made a few trivial changes to our parsers and to tika-eval.
>
>     We appear to have many more exceptions in files parsed by our 
> CompressorParser, but this is because of reporting...not because of 
> reality
> -- the exception is now coming in the container file, not an 
> attachment...and tika-eval wasn't matching A and B correctly.
>
>     There is a regression that's been fixed in PDFBox trunk 
> (PDFBOX-3717), but I don't see that as a blocker.
>
>     We have new exceptions in the new parsers, EMF, WMF, .xlsb, 
> wordperfect, but that's because we're actually parsing those now. :)
>
>     All else looks to be in decent shape.
>
>     Chris and Team and All,
>       Let me know when you're ready for me to kick off the next 
> regression run.
>
>               Cheers,
>
>                       Tim
>
>
>
>
>     -----Original Message-----
>     From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.nasa.gov]
>     Sent: Wednesday, April 26, 2017 12:48 PM
>     To: dev@tika.apache.org
>     Subject: Re: Tika 1.15
>
>     Thank you!
>
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>     Chris Mattmann, Ph.D.
>     Principal Data Scientist, Engineering Administrative Office (3010) 
> Manager, NSF & Open Source Projects Formulation and Development 
> Offices
> (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>     Office: 180-503E, Mailstop: 180-503
>     Email: chris.a.mattmann@nasa.gov
>     WWW:  http://sunset.usc.edu/~mattmann/
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>     Director, Information Retrieval and Data Science Group (IRDS) 
> Adjunct Associate Professor, Computer Science Department University of 
> Southern California, Los Angeles, CA 90089 USA
>     WWW: http://irds.usc.edu/
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>
>
>     On 4/26/17, 9:35 AM, "Allison, Timothy B." <tallison@mitre.org> wrote:
>
>         Oh.  Ok.  Will wait, then?
>
>         -----Original Message-----
>         From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.
> nasa.gov]
>         Sent: Wednesday, April 26, 2017 11:38 AM
>         To: dev@tika.apache.org
>         Subject: Re: Tika 1.15
>
>         I want to see if I can get in the VideoRecognition parser, and 
> also the Sentiment one.
>
>         I hope to get it done in the next day or so. Thanks.
>
>         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>         Chris Mattmann, Ph.D.
>         Principal Data Scientist, Engineering Administrative Office 
> (3010) Manager, NSF & Open Source Projects Formulation and Development 
> Offices
> (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>         Office: 180-503E, Mailstop: 180-503
>         Email: chris.a.mattmann@nasa.gov
>         WWW:  http://sunset.usc.edu/~mattmann/
>         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>         Director, Information Retrieval and Data Science Group (IRDS) 
> Adjunct Associate Professor, Computer Science Department University of 
> Southern California, Los Angeles, CA 90089 USA
>         WWW: http://irds.usc.edu/
>         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>
>
>         On 4/26/17, 7:54 AM, "Allison, Timothy B." 
> <tallison@mitre.org>
> wrote:
>
>             With the added TSD parser, I think I should rerun the 
> regression testing.  Given that, I also fixed 2099, and we'll benefit 
> from a rerun.
>
>             Anything else before I rerun the regression testing?
>
>             Any problems observed in first run?
>
>
>
>
>
>
>
>
>
Mime
View raw message