tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject The case of the unexpected error
Date Wed, 16 Dec 2009 00:31:27 GMT
I'd been running a large web crawl in EC2, using a Hadoop job jar  
where I'd excluded all of the support jars used for Microsoft formats.  
This dramatically reduced the size of the job jar that I needed to  
constantly push to EC2 via a relatively slow DSL connection.

During the crawl, I ignored all responses that didn't have a mime-type  
of text/plain or one of the three HTML mime-types.

But I ran into a problem, where the Tika auto-detect code was  
correctly identifying  a file as being a Microsoft format, even though  
the server said it was text/plain. The Tika Microsoft parser would try  
to dynamically figure out which support code to call, and in the end  
it throws a NoSuchMethodError.

Note that this is an Error, not an Exception. As such, it flies on  
past all of the Tika catch blocks, and my own code's catch blocks, and  
kills the Hadoop job in weird and wonderful ways.

It seems like Errors shouldn't be thrown for situations where dynamic  
configuration could result in a class not existing, but before I  
started writing up an issue I wanted to get input from the community  
about this. It's a bit gray to me, since I essentially "did it to  
myself" by excluding jars.


-- Ken

Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g

View raw message