tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Extensible content type detection
Date Sat, 17 Jan 2009 22:57:02 GMT

I've been thinking about how we currently do content type detection in
Tika and how we could improve things by making the type detection code
more modular and easier to extend. See TIKA-95 for some background.

I now think I have a pretty good idea on how to do this. See below for
a proposed Detector interface that's based on similar ideas as the
Parser interface that's worked really well for us. I would have
separate Detector classes for all the kinds of type detection
mechanisms we have (resource name, content type hint, magic bytes) and
may come up with int he future. In addition we'd have something like a
CompositeDetector class that delegates the detection task to
configured individual detectors and selects the most specific
resulting media type as the result of the whole type detection



Jukka Zitting

package org.apache.tika.detect;

import java.io.IOException;
import java.io.InputStream;

import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MimeType;

 * Content type detector. Implementations of this interface use various
 * heuristics to detect the content type of a document based on given
 * input metadata or the first few bytes of the document stream.
 * @since Apache Tika 0.3
public interface Detector {

     * Detects the content type of the given input document. Returns
     * <code>application/octet-stream</code> if the type of the document
     * can not be detected.
     * <p>
     * If the document input stream is not available, then the first
     * argument may be <code>null</code>. Otherwise the detector is may
     * read a bounded number of bytes from the start of the stream to help
     * in type detection. The stream must not be closed or otherwise
     * manipulated other by simply reading bytes from it, as the caller
     * may use the mark feature to be able to reset the stream to the
     * beginning for proper parsing when the content type is detected.
     * For the same reason the detector must only read up to a limited
     * number of bytes from the stream to avoid potentially unbounded
     * memory use for the buffer of a marked a stream.
     * <p>
     * The given input metadata is only read, not modified, by the detector.
     * @param input document input stream, or <code>null</code>
     * @param metadata input metadata for the document
     * @return detected media type, or <code>application/octet-stream</code>
     * @throws IOException if the document input stream could not be read
    MimeType detect(InputStream input, Metadata metadata) throws IOException;


View raw message