tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jukka Zitting <jukka.zitt...@gmail.com>
Subject Re: HTML styles and <li> tags are ignored
Date Mon, 04 Jun 2012 12:36:15 GMT

On Mon, Jun 4, 2012 at 2:21 PM, andrewtr <andrew.tr@compvue.com> wrote:
> While I am parsing the PDF or Word document using AutoDetectParser the <li>,
> <ul> tags are converted as <p> tags. I need the exact HTML content what is
> been there for PDF or Word Document.

<li> and <ul> tags in PDF or Word? I assume you rather mean the native
list formatting of those document types?

The Tika parsers for PDF and Office documents could/should
automatically map such formatting to equivalent XHTML constructs, but
I don't think they currently do. You'll need to look into the source
code to see how to make that happen.


Jukka Zitting

View raw message