tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Wechner <michael.wech...@wyona.com>
Subject Re: XHTML Bean and corresponding content handler
Date Sat, 08 Aug 2009 19:55:40 GMT
Jukka Zitting schrieb:
> Hi,
> On Fri, Aug 7, 2009 at 2:53 PM, Michael
> Wechner<michael.wechner@wyona.com> wrote:
>> I did some more debugging and it really seems to me that the
>> XHTMLContentHandler does not add meta content to the head of
>> the XHTML and hence when using the WriteOutContentHandler one does not
>> receive this meta content, but one has to make
>> sure to retrieve the meta content separately in order to make a "full text"
>> index.
>> Is this a feature or a bug or do I misunderstand something?
> It's a feature. The title is included in the <head/> section just to
> make the resulting XHTML validate

ok, thanks for pointing this out. I think it would be good to add a note 
somewhere within the code about this or does this already exist and I 
just missed it?
>  and so far we haven't had people
> needing more metadata in there. The Parser interface is designed to
> return document metadata in the Metadata object and the structured
> text content through the given ContentHandler. Is there a good use
> case for why the metadata should be exposed also in the <head/>
> section of the XHTML stream?

as I mentioned below when using the WriteOutContentHandler one wouldn't 
have to extract the metadata explicitely
>> Also it seems to me that the WriteOutContentHandler concatenates title and
>> body which means the last word of the title and the first word of the body
>> are "merged" and hence are probably not indexed correctly at some later
>> stage.
> I would suggest using BodyContentHandler instead of
> WriteOutContentHandler. You can use it just like
> WriteOutContentHandler, but it only outputs the contents of the
> <body/> section. See the --text option in TikaCLI or the ParsingReader
> class for good examples.

yes, I have seen the BodyContentHandler, but it means I have to 
explicitely concatenate the title (and the other meta data), which is 
not that much
effort, but as said I think it defeats the purpose of the 
WriteOutContentHandler ;-)

Thanks for your explanations

> BR,
> Jukka Zitting

View raw message