tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Jackson (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-1302) Let's run Tika against a large batch of docs nightly
Date Tue, 21 Oct 2014 12:59:34 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178361#comment-14178361
] 

Andrew Jackson edited comment on TIKA-1302 at 10/21/14 12:59 PM:
-----------------------------------------------------------------

Okay, so the c.300,000 exceptions are here: https://www.dropbox.com/s/ka19fguaxflp725/parse_errors.csv.gz?dl=0
- let me know if you'd like it placed elsewhere (it's 14MB of compressed CSV).

This conversation has helped me spot a gap in our code. We currently do a Tika.detect() before
we do a Tika.parse(), and only do the latter if the former succeeded. Sadly, the version of
the code that I used to generate this data did not record the Tika exception for the .detect()
step, only the .parse() step. This will explain why there are no hung-thread events in this
result set - the interrupted .detect() was not recorded properly.  We'll be re-running this
scan soonish, so I'll make sure the next version records all the exceptions. IIRC, from looking
at the MIME types, the permanent hangs were mostly ZIPs, Office documents, and maybe some
PDFs.

Note that the CSV includes the Content-Type from the .detect() step, and this should indicate
which module was run on the resource (i.e. whatever the Tika 1.5 mapping was for that MIME
type). I don't think we changed the parse configuration significantly, so it seems HTML and
XHTML and XML should all have gone through the HtmlParser (I'm not 100% sure about this, and
will try to check).

I'm not sure it's worth giving you all the SAX exceptions, as there are a lot of repeats of
the same problems. I think a random sample of about 50,000 should be plenty. Does that sound
okay to you?

EDIT: Oh, and I meant to say, I'm glad to hear about [~gostep] and [~tallison@apache.org]'s
efforts to run this on GovDocs, and would be interested in comparing results. We already publish
format profile data about web archives, and would love to have more data to refer to.


was (Author: anjackson):
Okay, so the c.300,000 exceptions are here: https://www.dropbox.com/s/ka19fguaxflp725/parse_errors.csv.gz?dl=0
- let me know if you'd like it placed elsewhere (it's 14MB of compressed CSV).

This conversation has helped me spot a gap in our code. We currently do a Tika.detect() before
we do a Tika.parse(), and only do the latter if the former succeeded. Sadly, the version of
the code that I used to generate this data did not record the Tika exception for the .detect()
step, only the .parse() step. This will explain why there are no hung-thread events in this
result set - the interrupted .detect() was not recorded properly.  We'll be re-running this
scan soonish, so I'll make sure the next version records all the exceptions. IIRC, from looking
at the MIME types, the permanent hangs were mostly ZIPs, Office documents, and maybe some
PDFs.

Note that the CSV includes the Content-Type from the .detect() step, and this should indicate
which module was run on the resource (i.e. whatever the Tika 1.5 mapping was for that MIME
type). I don't think we changed the parse configuration significantly, so it seems HTML and
XHTML and XML should all have gone through the HtmlParser (I'm not 100% sure about this, and
will try to check).

I'm not sure it's worth giving you all the SAX exceptions, as there are a lot of repeats of
the same problems. I think a random sample of about 50,000 should be plenty. Does that sound
okay to you?

> Let's run Tika against a large batch of docs nightly
> ----------------------------------------------------
>
>                 Key: TIKA-1302
>                 URL: https://issues.apache.org/jira/browse/TIKA-1302
>             Project: Tika
>          Issue Type: Improvement
>          Components: cli, general, server
>            Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and running again,
it might be fun to run Tika regularly against a large set of docs and report metrics.
> One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message