tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Re: Tika use cases
Date Mon, 10 Sep 2007 18:37:34 GMT
Hi,

On 9/10/07, kbennett <kbennett@bbsinc.biz> wrote:
> Thanks for responding.  What you said made perfect sense.  My domain
> knowledge in this area is very limited, so I apologize in advance for that.

No need to apologize. I don't consider being much of an expert myself
either, so feel free to dispute anything I say. :-)

> So a given parser (e.g. an MS Word document parser) might be instantiated at
> its first use with "global" options, that is, options for all parses, and
> then each call to extractMetadata would use that instance and be given
> file-specific options?  So it might look something like this?:

Exactly! An even more concrete example would be:

    // I want to extract metadata from a file I've been given
    File file = ...;

    // Construct a composite parser capable of parsing multiple document types
    CompositeParser composite = new CompositeParser();

    // Add support for MS Word documents
    WordParser word = new WordParser();
    word.setExtractDocumentProperties(false); // Nobody ever fills in these
    word.setExtractDeletedContent(true); // I want all the secrets!
    composite.addParser(word);

    // Fill in all the metadata we already know
    Metadata metadata = new Metadata();
    metadata.assert("filename", file.getName(), Confidence.CERTAIN);
    metadata.assert("content-length", file.length(), Confidence.CERTAIN);

    // Extract metadata from the given file
    InputStream stream = new FileInputStream(file);
    try {
        composite.extractMetadata(stream, metadata);
    } finally {
        stream.close();
    }

Note that we might well want to include some convenience code to
streamline common options, but the above reflects my understanding of
a truly generic mechanism.

Some specific notes on the above example code, especially on parts I
haven't discussed before:

1) The interfaces as currently envisioned should work seamlessly with
composition and decoration. I think "compatibility" with such patterns
is highly desirable.

2) I'd like to extend the current Metadata framework from Nutch with
support for multiple (potentially conflicting) sources of information
with various confidence levels. See the above code for an early
example. Support for things like the Shared MIME info database also
require such "fuzzy" metadata.

3) After some consideration I think it's better if the parser
components would consume but never close the given input streams. IMHO
(feel free to disagree) the responsibility of closing a stream should
always be on whoever opened the stream in the first place.

BR,

Jukka Zitting

Mime
View raw message