tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Wechner <michael.wech...@wyona.com>
Subject Re: XHTML Bean and corresponding content handler
Date Sat, 08 Aug 2009 19:55:40 GMT
Jukka Zitting schrieb:
> Hi,
>
> On Fri, Aug 7, 2009 at 2:53 PM, Michael
> Wechner<michael.wechner@wyona.com> wrote:
>   
>> I did some more debugging and it really seems to me that the
>> XHTMLContentHandler does not add meta content to the head of
>> the XHTML and hence when using the WriteOutContentHandler one does not
>> receive this meta content, but one has to make
>> sure to retrieve the meta content separately in order to make a "full text"
>> index.
>>
>> Is this a feature or a bug or do I misunderstand something?
>>     
>
> It's a feature. The title is included in the <head/> section just to
> make the resulting XHTML validate

ok, thanks for pointing this out. I think it would be good to add a note 
somewhere within the code about this or does this already exist and I 
just missed it?
>  and so far we haven't had people
> needing more metadata in there. The Parser interface is designed to
> return document metadata in the Metadata object and the structured
> text content through the given ContentHandler. Is there a good use
> case for why the metadata should be exposed also in the <head/>
> section of the XHTML stream?
>   

as I mentioned below when using the WriteOutContentHandler one wouldn't 
have to extract the metadata explicitely
>   
>> Also it seems to me that the WriteOutContentHandler concatenates title and
>> body which means the last word of the title and the first word of the body
>> are "merged" and hence are probably not indexed correctly at some later
>> stage.
>>     
>
> I would suggest using BodyContentHandler instead of
> WriteOutContentHandler. You can use it just like
> WriteOutContentHandler, but it only outputs the contents of the
> <body/> section. See the --text option in TikaCLI or the ParsingReader
> class for good examples.
>   

yes, I have seen the BodyContentHandler, but it means I have to 
explicitely concatenate the title (and the other meta data), which is 
not that much
effort, but as said I think it defeats the purpose of the 
WriteOutContentHandler ;-)

Thanks for your explanations

Michael
> BR,
>
> Jukka Zitting
>   


Mime
View raw message