tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1332) Create "eval" code
Date Thu, 07 Apr 2016 14:09:25 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230287#comment-15230287

Tim Allison commented on TIKA-1332:

I gave up on that, and we're now using httpd.

The eval code currently exists as commandline calls.  I'm using h2 as the backend database,
which appears to be compatible with ASL 2.0.  As with all development cycles, I started with
a flat file, moved to an unfortunately complex db structure and will probably have to move
to nosql if we want this to scale...but not yet...

As above, there are two modes.
1) Profile a single run
   a) run tika-app on a directory of files, output with -J -t (Json representation of List<Metadata>
with text as the content)
   b) run the profiling code, which populates an h2 db
   c) run xml-configured reports db

2) Compare two runs
  a) run two versions of tika-app on a directory of files
  b) run the comparison code, which populates an h2 db
  c) run xml-configured reports against the db

I've pretty much given up on the notion of automatic testing.  A human has to look at the
reports and make sense of them.

Given the feedback I received at ApacheCon (egads, a year ago), I think I'd like to transition
this code into Tika for 1.14.

When the code is ready for review, I'll let y'all know.  Any and all feedback on the reports
to date would be great.

> Create "eval" code
> ------------------
>                 Key: TIKA-1332
>                 URL: https://issues.apache.org/jira/browse/TIKA-1332
>             Project: Tika
>          Issue Type: Sub-task
>          Components: cli, general, server
>            Reporter: Tim Allison
> For this issue, we can start with code to gather statistics on each run (# of exceptions
per file type, most common exceptions per file type, number of metadata items, total text
extracted, etc).  We should also be able to compare one run against another.  Going forward,
there's plenty of room to improve.

This message was sent by Atlassian JIRA

View raw message