tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Ott <alex...@gmail.com>
Subject Re: Detecting container formats
Date Tue, 15 Jun 2010 19:45:43 GMT
Hello

Ken Krugler  at "Tue, 15 Jun 2010 11:56:51 -0700" wrote:
 KK> I think this is a reasonable approach, as long as (per Alex's suggestion) it's
 KK> configurable in various ways.

 KK> E.g. if you know you don't want to parse OLE2-based files, so you've removed jars
for
 KK> those parser, then it would be great to have an easy  way of disabling the (more
 KK> expensive) mime-type detection, and  potentially avoid the dependency on these same
jars.

 KK> Separately, I think this issue might also trigger improvements to the existing "magic
 KK> bytes" detection code in Tika. IIRC, we wound up  adding full regex with some additional
 KK> matching rules in Krugle, to  extend the (from Nutch, same as Tika) mime-type detection
 KK> code to  better handle things like source code files. I imagine something  similar
might
 KK> be needed to reliably handle container matching.

I'm not sure - does Tika need full regex support, while in most mime type
detection tasks it's enough (from my experience in this branch) to have
only search function dynamic addressing function (for example, find Zip
signature somewhere, and then use mix of getByte(offset) to check other
values)

For source code it's better to use something like naive bayes - it works
well (as I remember from tests, that we made 6 years ago)...

-- 
With best wishes, Alex Ott, MBA
http://alexott.blogspot.com/        http://alexott.net/
http://alexott-ru.blogspot.com/
Skype: alex.ott

Mime
View raw message