tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-1302) Let's run Tika against a large batch of docs nightly
Date Mon, 20 Oct 2014 17:23:35 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177054#comment-14177054
] 

Tim Allison edited comment on TIKA-1302 at 10/20/14 5:22 PM:
-------------------------------------------------------------

That would be a fantastic resource.  Thank you for sharing!  We could do a bit of munging
to prioritize most common exceptions in dependencies.

Your 0.1% exception rate is smaller than the 0.7% exception rate I'm finding on the govdocs1
corpus, but in the same ballpark.  Interesting.

Do you know how many permanent hangs you had and can you identify those files easily enough?
 I had about 6 in the govdocs1 corpus.

Thank you!

P.S. On the SAXParseExceptions...did those come from the XMLParser or from the HtmlParser?
 I recently discovered that we hardcode an override in TikaResource within tika-server:
{noformat}
 parsers.put(MediaType.APPLICATION_XML, new HtmlParser());
{noformat}

Not sure that we should hardcode that, but it does make sense to use that configuration!


was (Author: tallison@mitre.org):
That would be a fantastic resource.  Thank you for sharing!  We could do a bit of munging
to prioritize most common exceptions in dependencies.

Your 0.1% exception rate is smaller than the 0.7% exception rate I'm finding on the govdocs1
corpus, but in the same ballpark.  Interesting.

Do you know how many permanent hangs you had and can you identify those files easily enough?
 I had about 6 in the govdocs1 corpus.

Thank you!

> Let's run Tika against a large batch of docs nightly
> ----------------------------------------------------
>
>                 Key: TIKA-1302
>                 URL: https://issues.apache.org/jira/browse/TIKA-1302
>             Project: Tika
>          Issue Type: Improvement
>          Components: cli, general, server
>            Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and running again,
it might be fun to run Tika regularly against a large set of docs and report metrics.
> One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message