tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jukka Zitting <jukka.zitt...@gmail.com>
Subject Re: Metadata Discussion Status
Date Tue, 03 Aug 2010 10:00:46 GMT
Hi,

On Mon, Aug 2, 2010 at 10:36 PM, Paul Jakubik <paul@purediscovery.com> wrote:
> A while ago I added the http://wiki.apache.org/tika/MetadataDiscussion page
> to the Tika wiki.
>
> Since then, with the help of Jukka Zitting, a solution has been described
> for using the current Tika library to capture nested document metadata and
> associate that with the text extracted for each nested document.

Thanks for documenting this all on the wiki!

> What hasn't been accomplished is identifying a way to get to both the
> metadata and text for nested documents without the user writing a
> ContentHandler.
> [...]
> Are there any thoughts on how to move forward? Is it okay if users who want
> to extract nested documents with metadata resort to writing their own
> content handlers and parser decorators? Or would the Tika team prefer to
> offer an easier way for users to extract nested documents with metadata?

It would be great if you or someone else could come up with some nice
and clean utility classes for this.

PS. You wondered about how to get the text content of a component
document. That's pretty simple, just extend my earlier example to:

   public void parse(
           InputStream stream, ContentHandler handler,
           Metadata metadata, ParseContext context)
           throws IOException, SAXException, TikaException {
       ContentHandler content = new BodyContentHandler();
       super.parse(stream, content, metadata, context);

       System.out.println("----");
       System.out.println(metadata);
       System.out.println("----");
       System.out.println(content.toString());
   }

PPS. I'm currently writing a chapter about this technique and other
ways to use Tika parsers in our Tika In Action book [1]. This chapter
five should become available on the Manning early access program
within a month or two. We'd love to see comments on the existing
chapters and topics to be covered in future chapters. The book forum
is at [2].

[1] http://www.manning.com/mattmann/
[2] http://www.manning-sandbox.com/forum.jspa?forumID=678

BR,

Jukka Zitting

Mime
View raw message