tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mattmann <mattm...@apache.org>
Subject Re: Tika 1.15
Date Tue, 02 May 2017 13:57:09 GMT
Team, check out Polar Insights, which my USC IRDS student NIthin did:

http://polar.usc.edu/html/polar-deep-insights/index.html#/config

Click Download, then Download (the 2 download buttons), then Save, then
click the Query Interface. Something like this?

All code is OSS on http://github.com/USCDataScience/polar-deep-insights/ 

Cheers,
Chris


On 5/2/17, 4:54 AM, "Allison, Timothy B." <tallison@mitre.org> wrote:

    Y.  It is daunting at this point, and please do help!
    
    The key sheets I look at:
    
    exceptions/exceptions_compared_by_mime_type.xlsx
    exceptions/new_exceptions_in_B_by_mime.xlsx
    
    mimes/mime_diffs_A_to_B.xlsx
    
    attachments/attachment_diffs.xlsx
    
    metadata/metadata_value_count_diffs.xlsx
    
    I can dump json, but wouldn't it be easier for you to pull directly from the db?
    
    My vision is to put a gui on the db that would allow you to visualize the reports/see
the data and have links to the original (binary) files plus the extract files for both A and
B (perhaps with a diff visualization).
    
    Three cheers for d3.
    
    
    -----Original Message-----
    From: Tyler Bui-Palsulich [mailto:tpalsulich@apache.org] 
    Sent: Monday, May 1, 2017 11:39 PM
    To: dev@tika.apache.org
    Subject: RE: Tika 1.15
    
    How exactly did you "evaluate" the results? I opened the zip and looked at a few of the
sheets, but it's a bit daunting.
    
    Any way we could dump JSON? That's a bit easier to build visualizations for.
    
    Tyler
    
    On May 1, 2017 3:59 PM, "Allison, Timothy B." <tallison@mitre.org> wrote:
    
    > Sounds good.  W00t!
    >
    > -----Original Message-----
    > From: Chris Mattmann [mailto:mattmann@apache.org]
    > Sent: Monday, May 1, 2017 4:57 PM
    > To: dev@tika.apache.org
    > Subject: Re: Tika 1.15
    >
    > Thanks Tim. I am going to try and get tika-dl added (if possible), and 
    > also try the Sentiment Parser next. If I can get one or both of those 
    > (in the next day or so), then I will give you the heads up to begin testing.
    > Video recognition is in!
    >
    >
    >
    >
    >
    > On 5/1/17, 12:42 PM, "Allison, Timothy B." <tallison@mitre.org> wrote:
    >
    >     I finally had a chance to look through the results of the first 
    > regression run.
    >
    >     I made a few trivial changes to our parsers and to tika-eval.
    >
    >     We appear to have many more exceptions in files parsed by our 
    > CompressorParser, but this is because of reporting...not because of 
    > reality
    > -- the exception is now coming in the container file, not an 
    > attachment...and tika-eval wasn't matching A and B correctly.
    >
    >     There is a regression that's been fixed in PDFBox trunk 
    > (PDFBOX-3717), but I don't see that as a blocker.
    >
    >     We have new exceptions in the new parsers, EMF, WMF, .xlsb, 
    > wordperfect, but that's because we're actually parsing those now. :)
    >
    >     All else looks to be in decent shape.
    >
    >     Chris and Team and All,
    >       Let me know when you're ready for me to kick off the next 
    > regression run.
    >
    >               Cheers,
    >
    >                       Tim
    >
    >
    >
    >
    >     -----Original Message-----
    >     From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.nasa.gov]
    >     Sent: Wednesday, April 26, 2017 12:48 PM
    >     To: dev@tika.apache.org
    >     Subject: Re: Tika 1.15
    >
    >     Thank you!
    >
    >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    > ++++++++++++++
    >     Chris Mattmann, Ph.D.
    >     Principal Data Scientist, Engineering Administrative Office (3010) 
    > Manager, NSF & Open Source Projects Formulation and Development 
    > Offices
    > (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
    >     Office: 180-503E, Mailstop: 180-503
    >     Email: chris.a.mattmann@nasa.gov
    >     WWW:  http://sunset.usc.edu/~mattmann/
    >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    > ++++++++++++++
    >     Director, Information Retrieval and Data Science Group (IRDS) 
    > Adjunct Associate Professor, Computer Science Department University of 
    > Southern California, Los Angeles, CA 90089 USA
    >     WWW: http://irds.usc.edu/
    >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    > ++++++++++++++
    >
    >
    >     On 4/26/17, 9:35 AM, "Allison, Timothy B." <tallison@mitre.org> wrote:
    >
    >         Oh.  Ok.  Will wait, then?
    >
    >         -----Original Message-----
    >         From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.
    > nasa.gov]
    >         Sent: Wednesday, April 26, 2017 11:38 AM
    >         To: dev@tika.apache.org
    >         Subject: Re: Tika 1.15
    >
    >         I want to see if I can get in the VideoRecognition parser, and 
    > also the Sentiment one.
    >
    >         I hope to get it done in the next day or so. Thanks.
    >
    >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    > ++++++++++++++
    >         Chris Mattmann, Ph.D.
    >         Principal Data Scientist, Engineering Administrative Office 
    > (3010) Manager, NSF & Open Source Projects Formulation and Development 
    > Offices
    > (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
    >         Office: 180-503E, Mailstop: 180-503
    >         Email: chris.a.mattmann@nasa.gov
    >         WWW:  http://sunset.usc.edu/~mattmann/
    >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    > ++++++++++++++
    >         Director, Information Retrieval and Data Science Group (IRDS) 
    > Adjunct Associate Professor, Computer Science Department University of 
    > Southern California, Los Angeles, CA 90089 USA
    >         WWW: http://irds.usc.edu/
    >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    > ++++++++++++++
    >
    >
    >         On 4/26/17, 7:54 AM, "Allison, Timothy B." 
    > <tallison@mitre.org>
    > wrote:
    >
    >             With the added TSD parser, I think I should rerun the 
    > regression testing.  Given that, I also fixed 2099, and we'll benefit 
    > from a rerun.
    >
    >             Anything else before I rerun the regression testing?
    >
    >             Any problems observed in first run?
    >
    >
    >
    >
    >
    >
    >
    >
    >
    



Mime
View raw message