tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Wechner <michael.wech...@wyona.com>
Subject Re: XHTML Bean and corresponding content handler
Date Fri, 07 Aug 2009 12:53:17 GMT
Hi

I did some more debugging and it really seems to me that the 
XHTMLContentHandler does not add meta content to the head of
the XHTML and hence when using the WriteOutContentHandler one does not 
receive this meta content, but one has to make
sure to retrieve the meta content separately in order to make a "full 
text" index.

Is this a feature or a bug or do I misunderstand something?

Also it seems to me that the WriteOutContentHandler concatenates title 
and body which means the last word of the title and the first word of 
the body are "merged" and hence are probably not indexed correctly at 
some later stage. To make sure what I mean an example:

<head><title>My last title word</title></head><body><p>My
first body 
word</p></body>

and the output of writer.toString() re WriteOutContentHandler(writer) 
will be

My last title wordMy first body word

and hence wordMy will be indexed "badly".

Can anyone reproduce this?

Thanks

Michael

Michael Wechner schrieb:
> Jukka Zitting schrieb:
>> Hi,
>>
>> On Tue, Aug 4, 2009 at 9:30 AM, Michael
>> Wechner<michael.wechner@wyona.com> wrote:
>>  
>>> String XHTMLBean.getHead().getMeta(XHTMLBean.DESCRIPTION)
>>> String XHTMLBean.getHead().getTitle()
>>>     
>>
>> These you can get from the Metadata object.
>>   
>
> ok, I think I finally understood this, whereas I think it's a bit 
> "confusing" that one seems to set /html/head/title with
>
> metadata.set(metadata.TITLE, "some title");
>
> and to set /html/head/meta with for example
>
> metadata.set(metadata.KEYWORDS, "some keywords")
>
> whereas it seems that the title is really added when using 
> startDocument(), but for example the <meta name="keywords" 
> content="..."/> seems not to be added.
>
> Maybe I still misunderstand something though
>>  
>>> String[] XHTMLBean.getBody().getParagraphs();
>>>     
>>
>> This is a bit troublesome as not all parsers produce paragraphs of
>> content. For example the Excel parser produces XHTML tables.
>>   
>
> ok
>> You can either get just the plain character stream using tools like
>> BodyContentHandler, or the full XHTML output as SAX events (which you
>> can serialize to a byte stream if you want). I'm not sure if there's
>> any reasonable intermediate content abstraction.
>>   
>
> the reason I am looking for this is because it seems that various 
> search engines are using for the result excerpt the following order
>
> - <meta name="description" ...
> - first paragraph within body tag
> - ???
>
> Thanks
>
> Michael
>> BR,
>>
>> Jukka Zitting
>>   
>


Mime
View raw message