tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly
Date Mon, 20 Oct 2014 16:12:34 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177054#comment-14177054
] 

Tim Allison commented on TIKA-1302:
-----------------------------------

That would be a fantastic resource.  Thank you for sharing!  We could do a bit of munging
to prioritize most common exceptions in dependencies.

Your 0.1% exception rate is smaller than the 0.7% exception rate I'm finding on the govdocs1
corpus, but in the same ballpark.  Interesting.

Do you know how many permanent hangs you had and can you identify those files easily enough?
 I had about 6 in the govdocs1 corpus.

Thank you!

> Let's run Tika against a large batch of docs nightly
> ----------------------------------------------------
>
>                 Key: TIKA-1302
>                 URL: https://issues.apache.org/jira/browse/TIKA-1302
>             Project: Tika
>          Issue Type: Improvement
>          Components: cli, general, server
>            Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and running again,
it might be fun to run Tika regularly against a large set of docs and report metrics.
> One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message