tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jukka Zitting <jukka.zitt...@gmail.com>
Subject Re: XHTML Bean and corresponding content handler
Date Sat, 08 Aug 2009 17:41:33 GMT

On Fri, Aug 7, 2009 at 2:53 PM, Michael
Wechner<michael.wechner@wyona.com> wrote:
> I did some more debugging and it really seems to me that the
> XHTMLContentHandler does not add meta content to the head of
> the XHTML and hence when using the WriteOutContentHandler one does not
> receive this meta content, but one has to make
> sure to retrieve the meta content separately in order to make a "full text"
> index.
> Is this a feature or a bug or do I misunderstand something?

It's a feature. The title is included in the <head/> section just to
make the resulting XHTML validate and so far we haven't had people
needing more metadata in there. The Parser interface is designed to
return document metadata in the Metadata object and the structured
text content through the given ContentHandler. Is there a good use
case for why the metadata should be exposed also in the <head/>
section of the XHTML stream?

> Also it seems to me that the WriteOutContentHandler concatenates title and
> body which means the last word of the title and the first word of the body
> are "merged" and hence are probably not indexed correctly at some later
> stage.

I would suggest using BodyContentHandler instead of
WriteOutContentHandler. You can use it just like
WriteOutContentHandler, but it only outputs the contents of the
<body/> section. See the --text option in TikaCLI or the ParsingReader
class for good examples.


Jukka Zitting

View raw message