tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Jackson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly
Date Mon, 20 Oct 2014 14:39:34 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14176934#comment-14176934
] 

Andrew Jackson commented on TIKA-1302:
--------------------------------------

I have 2,358,167 errors from one collection (2 billion resources), but the majority are SAXParseExceptions.
It's made up of UK web archive content from 1996-2010, so there's lots of broken HTML/XML
in there. If I strip out the SAXParseExceptions, there's just 317,548 miscellaneous errors,
that are perhaps more interesting. 

Here's an example including the SAX exceptions:
{code:none}
wayback_date,url,content_length,content_type_tika,parse_error
20100713041445,http://www.expedia.co.uk:80/pub/agent.dll/qscr=dspv/nojs=1/htid=2737187,org.xml.sax.SAXParseException:
The markup in the document following the root element must be well-formed.
20091017141202,http://www.expedia.co.uk:80/pub/agent.dll/qscr=dspv/nojs=1/htid=34830/crti=4/hotel-pictures,"org.xml.sax.SAXParseException:
Open quote is expected for attribute ""ID"" associated with an  element type  ""COMMENT""."
20091017143741,http://www.madfun.co.uk:80/-10?ref=31,org.xml.sax.SAXParseException: The markup
in the document following the root element must be well-formed.
20061020021825,http://reservations.talkingcities.co.uk:80/nexres/hotels/map_hotels.cgi?hid=10055548&map_only=yes&type=overview,org.xml.sax.SAXParseException:
The markup in the document following the root element must be well-formed.
20061020022224,http://www.ravensportal.co.uk:80/forum/index.php?PHPSESSID=1688184d9bb881cfc73600b1670ecaf5&amp;type=rss;action=.xml,org.xml.sax.SAXParseException:
The character reference must end with the ';' delimiter.
20101227142905,http://www.etc-online.co.uk:80/style4.asp?pn=courses&sn=26,org.xml.sax.SAXParseException:
The markup in the document following the root element must be well-formed.
20060926015856,http://www.qca.org.uk/4412.html,"org.xml.sax.SAXParseException: The entity
""nbsp"" was referenced\, but not declared."
20040827075658,http://users.ox.ac.uk:80/~sedm1731/Work/Ex%20parte%20St%20Germain.doc,java.lang.ArrayIndexOutOfBoundsException:
-1
20030124193820,http://www.mgcars.org.uk:80/cgi-bin/gen5?runprog=porter&cov=&mode=buy&o=4854130936&code=9123&cu=&,"org.xml.sax.SAXParseException:
The element type ""META"" must be terminated by the matching end-tag ""</META>""."
20100121205831,http://www.epupz.co.uk:80/clas/viewdetails.asp?view=307389,org.xml.sax.SAXParseException:
The entity name must immediately follow the '&' in the entity reference.
{code}
...and for the others...
{code:none}
wayback_date,url,content_length,content_type_tika,parse_error
20100928070438,http://redtyger.co.uk/discuss/projectexternal.php,7524,application/rss+xml,java.lang.NullPointerException:
null
20040827075658,http://users.ox.ac.uk:80/~sedm1731/Work/Ex%20parte%20St%20Germain.doc,44997,application/msword,java.lang.ArrayIndexOutOfBoundsException:
-1
20060303154606,http://www.dfes.gov.uk:80/rsgateway/DB/SFR/s000286/sfr37-2001.doc,562004,application/msword,java.lang.IllegalArgumentException:
Position 698368 past the end of the file
20041225033311,http://members.lycos.co.uk:80/worldofradio/distance.pdf,57891,application/pdf,org.apache.pdfbox.exceptions.CryptographyException:
Error: The supplied password does not match either the owner or user password in the document.
20041121095540,http://scom.hud.ac.uk:80/scomzl/conference/chenhua/040528_01E/PDP2148.pdf,191115,application/pdf,"java.io.IOException:
Error: Expected a long type\, actual='25#0/'"
20041121095849,http://scom.hud.ac.uk:80/scomzl/conference/chenhua/040528_01E/SER2549.pdf,157148,application/pdf,java.util.zip.DataFormatException:
oversubscribed literal/length tree
20041121100005,http://scom.hud.ac.uk:80/scomzl/conference/chenhua/040528_01E/MSV_Foreword.pdf,12773,application/pdf,java.util.zip.DataFormatException:
oversubscribed dynamic bit lengths tree
20060925090249,http://www2.rgu.ac.uk/library_edocs/resource/exam/0405engineering/EN3581%20OFFSHORE%20ENGINEERING.pdf,1684742,application/pdf,org.apache.pdfbox.exceptions.CryptographyException:
Error: The supplied password does not match either the owner or user password in the document.
20060925091406,http://www2.rgu.ac.uk/library_edocs/resource/exam/0304engineering/EE31060304s1.pdf,149238,application/pdf,org.apache.pdfbox.exceptions.CryptographyException:
Error: The supplied password does not match either the owner or user password in the document.
20040612212128,http://www.swhst.org.uk:80/Linked%20Files/spr%20contact%20addresses.xls,23040,application/vnd.ms-excel,org.apache.poi.EncryptedDocumentException:
Default password is invalid for docId/saltData/saltHash
20051111183952,http://freeweb.co.uk:80/show_nw.php?ref=258&target=B&show=aff&PHPSESSID=a150a130c58fcea048866fb965ef7dfb,232436,text/html;
charset=iso-8859-1,org.apache.tika.sax.SecureContentHandler$SecureSAXException: Suspected
zip bomb: 100 levels of XML element nesting
20071025140555,http://www.honleyhigh.kirklees.sch.uk/MFL/MFL_Links/PowerPoint%20Presentations/German/Geryear-9-future-tense.ppt,2664960,application/vnd.ms-powerpoint,"org.apache.poi.hslf.exceptions.OldPowerPointFormatException:
Based on the Current User stream\, you seem to have supplied a PowerPoint95 file\, which isn't
supported"
20071207004337,http://www.jisc.org.uk/uploaded_documents/e-port-brief.ppt,155136,application/vnd.ms-powerpoint,java.lang.ArrayIndexOutOfBoundsException:
20
{code}

The first two columns identify the item. The next two are the size of the item in bytes, and
the result of using Tika to identity the format (.detect only, no parse). The last column
contains the first line of the parse exception(s).

Note that to download the original item, you can get them from the Internet archive using
this template: 
{code:none}
http://web.archive.org/web/{wayback_date}/{url}
{code}
i.e. for the last exception listed above, you can download the item at: http://web.archive.org/web/20071207004337/http://www.jisc.org.uk/uploaded_documents/e-port-brief.ppt

It might take me a while to generate the full output for the 2.3 million, so I'll try to pull
out the 300 thousand other errors first. Our Solr index is having some performance issues,
so it might a bit slow.

> Let's run Tika against a large batch of docs nightly
> ----------------------------------------------------
>
>                 Key: TIKA-1302
>                 URL: https://issues.apache.org/jira/browse/TIKA-1302
>             Project: Tika
>          Issue Type: Improvement
>          Components: cli, general, server
>            Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and running again,
it might be fun to run Tika regularly against a large set of docs and report metrics.
> One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message