tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Re: [jira] Updated: (TIKA-26) Use Map<String, Content> instead of List<Content>
Date Sun, 23 Sep 2007 20:39:51 GMT

On 9/23/07, kbennett <kbennett@bbsinc.biz> wrote:
> 1) I suggest we create a class to store the parsed document content, rather
> than just a Map.  The class could have convenience methods such as
> getStringContent(), and possibly hold onto a resource identifier that could
> be set.  We might also want to make the parsed values immutable.

This is what I had in mind for the Metadata instance in my proposed
Parser interface design. I think I have a reasonable evolutionary path
designed for transforming the current Parser interfaces to this
proposed model. Something like this:

    current: List<Content> getContents();
    TIKA-26: Map<String,Content> getContents();
    TIKA-n1: Map<String,Content> parse(InputStream stream);
    TIKA-n2: String parse(InputStream stream, Map<String,Content> metadata);
    TIKA-n3: String parse(InputStream stream, Metadata metadata);
    TIKA-n4: void parse(InputStream stream, ContentHanlder handler,
Metadata metadata);

> 2) If we make the Parser stateless, how will we deal with the chunking of
> large documents?

By making the parse method output SAX events instead of  a single
String that contains the text content of the entire document.


Jukka Zitting

View raw message