tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Wechner <michael.wech...@wyona.com>
Subject Re: XHTML Bean and corresponding content handler
Date Thu, 06 Aug 2009 23:01:53 GMT
Jukka Zitting schrieb:
> Hi,
>
> On Tue, Aug 4, 2009 at 9:30 AM, Michael
> Wechner<michael.wechner@wyona.com> wrote:
>   
>> String XHTMLBean.getHead().getMeta(XHTMLBean.DESCRIPTION)
>> String XHTMLBean.getHead().getTitle()
>>     
>
> These you can get from the Metadata object.
>   

ok, I think I finally understood this, whereas I think it's a bit 
"confusing" that one seems to set /html/head/title with

metadata.set(metadata.TITLE, "some title");

and to set /html/head/meta with for example

metadata.set(metadata.KEYWORDS, "some keywords")

whereas it seems that the title is really added when using 
startDocument(), but for example the <meta name="keywords" 
content="..."/> seems not to be added.

Maybe I still misunderstand something though
>   
>> String[] XHTMLBean.getBody().getParagraphs();
>>     
>
> This is a bit troublesome as not all parsers produce paragraphs of
> content. For example the Excel parser produces XHTML tables.
>   

ok
> You can either get just the plain character stream using tools like
> BodyContentHandler, or the full XHTML output as SAX events (which you
> can serialize to a byte stream if you want). I'm not sure if there's
> any reasonable intermediate content abstraction.
>   

the reason I am looking for this is because it seems that various search 
engines are using for the result excerpt the following order

- <meta name="description" ...
- first paragraph within body tag
- ???

Thanks

Michael
> BR,
>
> Jukka Zitting
>   


Mime
View raw message