tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1443) Add a junk text detector to Tika
Date Tue, 14 Oct 2014 03:33:34 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14170462#comment-14170462

Chris A. Mattmann commented on TIKA-1443:


> Add a junk text detector to Tika
> --------------------------------
>                 Key: TIKA-1443
>                 URL: https://issues.apache.org/jira/browse/TIKA-1443
>             Project: Tika
>          Issue Type: Wish
>            Reporter: Tim Allison
>            Priority: Minor
> It would be helpful to have a detector that flags documents whose extracted text is junk.
 This could be used as a component of TIKA-1332 or as a standalone detector.  See TIKA-1332
for some initial ideas of what statistics we might use for such a detector.
> Two use cases:
> * Parser developers could quickly see whether changes in code lead to less "junky" documents
or more "junky" documents.  This would also aid in prioritizing manual review of output comparison
(see discussion in TIKA-1419).
> * Search system integrators could use that information to set document specific relevancy
rankings or to avoid indexing a document

This message was sent by Atlassian JIRA

View raw message