tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8
Date Thu, 23 Oct 2014 00:18:33 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180783#comment-14180783
] 

Tim Allison edited comment on TIKA-1442 at 10/23/14 12:18 AM:
--------------------------------------------------------------

Top10Words: top 10 most frequent tokens
NumTop10EnStopWords: of the top 10 most frequent tokens, how many are English stopwords

As above, if NumTop10EnStopWords proves to be of any use, we'll want to add stopwords for
other languages and calculate the number of stop words _for that language_ that are in the
top 10 most frequent.

My German is rusty, but these don't ring many bells:
{noformat}
znk: 42 | ul: 31 | 2: 30 | gtj: 28 | k: 26 | 4: 19 | ugq: 19 | 6: 19 | 7: 17 | yvkioky: 17
{noformat}

On a side note, I figured out how a pair of docs can have a perfect Dice coefficient but have
differing lang id confidence scores:  the Dice coefficient is calculated on tokens identified
by Lucene's ICUTokenizer+ICUFoldingFilter; whereas the lang id score is calculated based on
the string.  I suspect that for those doc pairs with a lower lang id score, there will be
more junk that was "cleaned" out by the Analyzer.


was (Author: tallison@mitre.org):
Top10Words: top 10 most frequent tokens
NumTop10EnStopWords: of the top 10 most frequent tokens, how many are English stopwords

As above, if NumTop10EnStopWords proves to be of any use, we'll want to add stopwords for
other languages and calculate the number of stop words _for that language_ that are in the
top 10 most frequent.

On a side note, I figured out how a pair of docs can have a perfect Dice coefficient but have
differing lang id confidence scores:  the Dice coefficient is calculated on tokens identified
by Lucene's ICUTokenizer+ICUFoldingFilter; whereas the lang id score is calculated based on
the string.  I suspect that for those doc pairs with a lower lang id score, there will be
more junk that was "cleaned" out by the Analyzer.

> Upgrade to PDFBox 1.8.8
> -----------------------
>
>                 Key: TIKA-1442
>                 URL: https://issues.apache.org/jira/browse/TIKA-1442
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>             Fix For: 1.7
>
>         Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon
as it is ready.  I'm tempted to call this a blocker on Tika 1.7.  Let's use this issue to
carry on the discussion of regression testing (if any further discussion is necessary) or
any other prep that needs to happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message