tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tyler Bui-Palsulich <tpalsul...@apache.org>
Subject Re: Tika 1.15
Date Tue, 02 May 2017 23:19:58 GMT
Thanks for the link. It looks like the UI is written with Angular and uses
Elastic + static JSON. See
https://github.com/USCDataScience/polar-deep-insights/wiki/Architecture.

I also like d3. In general, I think we are on the same page the best option
is a web based UI.

I see a few options to get data into the frontend:
1. Static JSON
2. JSON from a server (meaning the server runs queries (either built by the
client or the server))
3. Load a local DB (meaning the client runs queries)

>From some quick searching, 3 seems like it has poor support. I could be
wrong.

1 and 2 are clearly related. If we have a working application with static
JSON, changing it to use served JSON should be straightforward (from a Java
server, probably). Static JSON will be faster than live queries, but I
don't know how long the queries take. The polar project seems to hard code
queries and provide an interface to manually enter more.

Static JSON seems easiest to get started. What do you think?

Tyler

On May 2, 2017 6:57 AM, "Chris Mattmann" <mattmann@apache.org> wrote:

> Team, check out Polar Insights, which my USC IRDS student NIthin did:
>
> http://polar.usc.edu/html/polar-deep-insights/index.html#/config
>
> Click Download, then Download (the 2 download buttons), then Save, then
> click the Query Interface. Something like this?
>
> All code is OSS on http://github.com/USCDataScience/polar-deep-insights/
>
> Cheers,
> Chris
>
>
> On 5/2/17, 4:54 AM, "Allison, Timothy B." <tallison@mitre.org> wrote:
>
>     Y.  It is daunting at this point, and please do help!
>
>     The key sheets I look at:
>
>     exceptions/exceptions_compared_by_mime_type.xlsx
>     exceptions/new_exceptions_in_B_by_mime.xlsx
>
>     mimes/mime_diffs_A_to_B.xlsx
>
>     attachments/attachment_diffs.xlsx
>
>     metadata/metadata_value_count_diffs.xlsx
>
>     I can dump json, but wouldn't it be easier for you to pull directly
> from the db?
>
>     My vision is to put a gui on the db that would allow you to visualize
> the reports/see the data and have links to the original (binary) files plus
> the extract files for both A and B (perhaps with a diff visualization).
>
>     Three cheers for d3.
>
>
>     -----Original Message-----
>     From: Tyler Bui-Palsulich [mailto:tpalsulich@apache.org]
>     Sent: Monday, May 1, 2017 11:39 PM
>     To: dev@tika.apache.org
>     Subject: RE: Tika 1.15
>
>     How exactly did you "evaluate" the results? I opened the zip and
> looked at a few of the sheets, but it's a bit daunting.
>
>     Any way we could dump JSON? That's a bit easier to build
> visualizations for.
>
>     Tyler
>
>     On May 1, 2017 3:59 PM, "Allison, Timothy B." <tallison@mitre.org>
> wrote:
>
>     > Sounds good.  W00t!
>     >
>     > -----Original Message-----
>     > From: Chris Mattmann [mailto:mattmann@apache.org]
>     > Sent: Monday, May 1, 2017 4:57 PM
>     > To: dev@tika.apache.org
>     > Subject: Re: Tika 1.15
>     >
>     > Thanks Tim. I am going to try and get tika-dl added (if possible),
> and
>     > also try the Sentiment Parser next. If I can get one or both of those
>     > (in the next day or so), then I will give you the heads up to begin
> testing.
>     > Video recognition is in!
>     >
>     >
>     >
>     >
>     >
>     > On 5/1/17, 12:42 PM, "Allison, Timothy B." <tallison@mitre.org>
> wrote:
>     >
>     >     I finally had a chance to look through the results of the first
>     > regression run.
>     >
>     >     I made a few trivial changes to our parsers and to tika-eval.
>     >
>     >     We appear to have many more exceptions in files parsed by our
>     > CompressorParser, but this is because of reporting...not because of
>     > reality
>     > -- the exception is now coming in the container file, not an
>     > attachment...and tika-eval wasn't matching A and B correctly.
>     >
>     >     There is a regression that's been fixed in PDFBox trunk
>     > (PDFBOX-3717), but I don't see that as a blocker.
>     >
>     >     We have new exceptions in the new parsers, EMF, WMF, .xlsb,
>     > wordperfect, but that's because we're actually parsing those now. :)
>     >
>     >     All else looks to be in decent shape.
>     >
>     >     Chris and Team and All,
>     >       Let me know when you're ready for me to kick off the next
>     > regression run.
>     >
>     >               Cheers,
>     >
>     >                       Tim
>     >
>     >
>     >
>     >
>     >     -----Original Message-----
>     >     From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.
> nasa.gov]
>     >     Sent: Wednesday, April 26, 2017 12:48 PM
>     >     To: dev@tika.apache.org
>     >     Subject: Re: Tika 1.15
>     >
>     >     Thank you!
>     >
>     >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >     Chris Mattmann, Ph.D.
>     >     Principal Data Scientist, Engineering Administrative Office
> (3010)
>     > Manager, NSF & Open Source Projects Formulation and Development
>     > Offices
>     > (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>     >     Office: 180-503E, Mailstop: 180-503
>     >     Email: chris.a.mattmann@nasa.gov
>     >     WWW:  http://sunset.usc.edu/~mattmann/
>     >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >     Director, Information Retrieval and Data Science Group (IRDS)
>     > Adjunct Associate Professor, Computer Science Department University
> of
>     > Southern California, Los Angeles, CA 90089 USA
>     >     WWW: http://irds.usc.edu/
>     >     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >
>     >
>     >     On 4/26/17, 9:35 AM, "Allison, Timothy B." <tallison@mitre.org>
> wrote:
>     >
>     >         Oh.  Ok.  Will wait, then?
>     >
>     >         -----Original Message-----
>     >         From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.
>     > nasa.gov]
>     >         Sent: Wednesday, April 26, 2017 11:38 AM
>     >         To: dev@tika.apache.org
>     >         Subject: Re: Tika 1.15
>     >
>     >         I want to see if I can get in the VideoRecognition parser,
> and
>     > also the Sentiment one.
>     >
>     >         I hope to get it done in the next day or so. Thanks.
>     >
>     >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >         Chris Mattmann, Ph.D.
>     >         Principal Data Scientist, Engineering Administrative Office
>     > (3010) Manager, NSF & Open Source Projects Formulation and
> Development
>     > Offices
>     > (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>     >         Office: 180-503E, Mailstop: 180-503
>     >         Email: chris.a.mattmann@nasa.gov
>     >         WWW:  http://sunset.usc.edu/~mattmann/
>     >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >         Director, Information Retrieval and Data Science Group (IRDS)
>     > Adjunct Associate Professor, Computer Science Department University
> of
>     > Southern California, Los Angeles, CA 90089 USA
>     >         WWW: http://irds.usc.edu/
>     >         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     > ++++++++++++++
>     >
>     >
>     >         On 4/26/17, 7:54 AM, "Allison, Timothy B."
>     > <tallison@mitre.org>
>     > wrote:
>     >
>     >             With the added TSD parser, I think I should rerun the
>     > regression testing.  Given that, I also fixed 2099, and we'll benefit
>     > from a rerun.
>     >
>     >             Anything else before I rerun the regression testing?
>     >
>     >             Any problems observed in first run?
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message