tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antoni Mylka <antoni.my...@gmail.com>
Subject TikaMimeTypeIdentifier in Aperture
Date Fri, 03 Dec 2010 00:07:03 GMT
Hello Aperture

(cc tika-dev, may be interesting for you too)

As you know Tika has made certain advances in the field of mime type
identification, which we (Aperture) wanted to implement for a long
time. This is the feature request 3043080 but it applies to a bug
3025427 and feature requests: 2210328 (ZipContainerDetector), 1838840
and 1650532 (root-XML-based detection). The oldest is almost 4 years
old.

That's why I decided to explore the idea of an implementation of the
Aperture MimeTypeIdentifier interface, which would delegate the actual
identification to Tika ContainerAwareDetector backed by Tika MimeTypes
class. I worked in aperture-addons, and now, I moved this to
aperture-core, to be included in the next release.

This turned out to be (much) more complex than I thought. There were
certain files which Tika recognized better, and certain that Aperture
recognized better. I submitted 7 issues to Tika JIRA and prepared a
little hack that allowed me to augment the tika-mimetypes.xml with the
knowledge from our mimetypes.xml file. As of now the only things that
the MagicMimeTypeIdentifier does better than TikaMimeTypeIdentifier
are:

- support for string patterns in UTF-16 documents. E.g. Tika can't
recognize XML, or HTML in a full UTF-16 file
- support for allowsWhiteSpace before a pattern, e.g. Tika had
problems recognizing the <html> tag if there is some whitespace in
front of it (now it works around that limitation in a good enough way
though, so it's actually not a problem)
- support for multiple parent types.
   - quattro pro 6 used a wordperfect magic, while later ones used
office magics,
   - older Corel Presentations used wordperfect magic, newer use office,
   - works spreadsheets 3.0 used a wordperfect magic, 4.0 used their
own format, 7.0 uses office
   The problem with Tika, is that it treats all those cases correctly
when only the name is provided, but when both name and bytes are
provided, the byte-based mime type trumps the name-based mime type,
because name-based is not a specialization of byte-based (because one
type can only have a single parent, so if we say that office is the
parent of works, we won't recognize works 3.0 and 4.0 but only 7.0).
 - getExtensionsFor(String mimeType), useful in many apps, in tika the
the mime knowledge base is hidden in private fields and
package-protected classes

Yet apart from these minor inconveniences, all of which will probably
disappear in near future, Tika brings benefits
- more mime type descriptions,
- "correct" names, either IANA-approved, or "proper" vendor-made
starting with "vnd." or "invented" ones starting with "x-"
- detection based on root XML element (at last we can correctly detect
XHTML docs with <?xml version="1.0" encoding="utf-8"?> header)
- better detection of OOXML and OLE docs without a name (thanks to
ZipContainerDetector and PoiContainerDetector), though only slightly,
the ContainerAwareDetector works best with a full file, but we give it
only the first 8KB
- better plaintext detection, and a couple of other improvements

I made TikaMimeTypeIdentifier the default choice in ApertureRuntime
and in Aperture's Example Application. Existing apps, which use the
MagicMimeTypeIdentifier will not see any difference, though their
authors are advised to take a look at the new implementation. The new
MimeTypeIdentifier uses different names for many mime types. In most
cases these different names are "better", yet they are different and
might require a modification of the client code.

Fixing the four limitations outlined above will require additional
patches to Tika. I wanted to "release" the code now, to allow for
testing, before the next Aperture release. In the long term, I think
that maintaining two separate mime type identifiers is a bad idea.

So, play with the ApertureRuntime, or the CLI apps, and try to
substitute "new MagicMimeTypeIdentifier" with "new
TikaMimeTypeIdentifier()" and see what happens.

Links:

The file with mime type info which was present in Aperture's
mimetypes-xml, but not in tika-mimetypes.xml
https://aperture.svn.sourceforge.net/svnroot/aperture/aperture/trunk/core/src/main/resources/org/semanticdesktop/aperture/tika/diff-mimetypes.xml

A diff between these two files, shows the differences in mimetype
identification.
Aperture identification (name, identification by data, identification
by name and data):
https://aperture.svn.sourceforge.net/svnroot/aperture/aperture/trunk/core/src/test/java/org/semanticdesktop/aperture/mime/identifier/magic/ApertureDocumentsIdentificationTest.java
Tika-based identification (only 8KB of each file is taken into
account, tika-mimetypes.xml is enhanced via MimeTypesEnhancer with the
content of diff-mimetypes.xml)
https://aperture.svn.sourceforge.net/svnroot/aperture/aperture/trunk/core/src/test/java/org/semanticdesktop/aperture/tika/TikaMimeTypeIdentifierTest.java

--
Antoni Myłka
antoni.mylka@gmail.com

Mime
View raw message