tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8
Date Fri, 10 Oct 2014 12:50:33 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14166755#comment-14166755

Tim Allison commented on TIKA-1442:

This is in response to our discussion on TIKA-1419.

Y, I agree that we should flag unparseables in the test set so that we don't have to manually
open them again and again to confirm that there's junk there, just different junk. If you
send me a junk list, I'll add a junk column to my local db for those files and include that
in future dumps.  Once we make this testing public, it would be great to create a ui to allow
people to flag extracted text as "great, let's use this extracted text as a gold standard
for text/metadata extraction" or to flag source docs as unparseable.

In my dev-dev version of the extractor comparison code, I include the top 10 most frequent
words in the doc and a count of how many of those are English stop words.  As you suggest,
that's a reasonable indicator (if the docs are English) that something might have gone wrong.

Another thing that would make manual review a whole lot easier would be a ui with a word-level

What other statistics could we use to help guide the manual review?

> Upgrade to PDFBox 1.8.8
> -----------------------
>                 Key: TIKA-1442
>                 URL: https://issues.apache.org/jira/browse/TIKA-1442
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>             Fix For: 1.7
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon
as it is ready.  I'm tempted to call this a blocker on Tika 1.7.  Let's use this issue to
carry on the discussion of regression testing (if any further discussion is necessary) or
any other prep that needs to happen before 1.8.8's release.

This message was sent by Atlassian JIRA

View raw message