tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8
Date Wed, 22 Oct 2014 11:20:36 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179809#comment-14179809
] 

Tim Allison edited comment on TIKA-1442 at 10/22/14 11:20 AM:
--------------------------------------------------------------

[~tilman], thank you, again, for all of your work on this.

Tika community, if you have a chance, take a look at the attached comparison file and recommend
other statistics that would be useful for file comparison (TIKA-1332) and junk detection (TIKA-1443).

I added the following columns:
language id: language and confidence score
top10words
count of the top 10 words that are stopwords in English (based on Lucene's StandardAnalyzer's
list)...I need to make this language specific...if the langid component says "so", we need
to count the number of so stopwords.

I renamed some of the column headers.  I finally had a chance to break out Manning and Schutze...
"token overlap" is actually Dice coefficient.

I added a vlookup column for [~tilman]'s notes. 

I cannot figure out why I'm getting different lang id confidence scores for a given file pair
if the Dice Coefficient is 1.0.  I need to look into this.

All a work in progress...


was (Author: tallison@mitre.org):
[~tilman], thank you, again, for all of your work on this.

Tika community, if you have a chance, take a look at the attached comparison file and recommend
other statistics that would be useful for file comparison (TIKA-1332) and junk detection TIKA-1443).

I added the following columns:
language id: language and confidence score
top10words
count of the top 10 words that are stopwords in English (based on Lucene's StandardAnalyzer's
list)...I need to make this language specific...if the langid component says "so", we need
to count the number of so stopwords.

I renamed some of the column headers.  I finally had a chance to break out Manning and Schutze...
"token overlap" is actually Dice coefficient.

I added a vlookup column for [~tilman]'s notes. 

I cannot figure out why I'm getting different lang id confidence scores for a given file pair
if the Dice Coefficient is 1.0.  I need to look into this.

All a work in progress...

> Upgrade to PDFBox 1.8.8
> -----------------------
>
>                 Key: TIKA-1442
>                 URL: https://issues.apache.org/jira/browse/TIKA-1442
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>             Fix For: 1.7
>
>         Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon
as it is ready.  I'm tempted to call this a blocker on Tika 1.7.  Let's use this issue to
carry on the discussion of regression testing (if any further discussion is necessary) or
any other prep that needs to happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message