tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2750) Update regression corpus
Date Fri, 05 Oct 2018 13:34:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16639839#comment-16639839

Tim Allison commented on TIKA-2750:

{{/data1}} includes: the original zips from Common Crawl contributed by [~jnioche], the zips
I downloaded from {{govdocs1}} and scientific data from [~chrismattmann].  I propose {{rm
-r}} on the original common crawl zips and the govdocs1 zips to clear up space for sloshing
data around and/or fuzzing.

> Update regression corpus
> ------------------------
>                 Key: TIKA-2750
>                 URL: https://issues.apache.org/jira/browse/TIKA-2750
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
> I think we've had great success with the current data on our regression corpus.  I'd
like to re-fresh some data from common crawl with three primary goals:
> 1) include more interesting documents (e.g. down sample English UTF-8 text/html)
> 2) include more recent documents (perhaps newer features in PDFs? definitely more ooxml)
> 3) identify and re-fetch truncated documents from the original site -- CommonCrawl truncates
docs at 1 MB.  I think some truncated documents have been quite useful, similar to fuzzing,
for identifying serious problems with some of our parsers.  However, it would be useful to
have more complete files, esp. for PDFs.  In short, we should keep some truncated documents,
but I'd also like to get more complete docs.

This message was sent by Atlassian JIRA

View raw message