tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Burch <nick.bu...@alfresco.com>
Subject Re: EpubContentParser for xhtml
Date Thu, 12 May 2011 10:53:04 GMT
On Wed, 11 May 2011, Alberto Barranco Ramón wrote:
> We are looking for a .epub parser. We consider Tika at begining but we
> realized making test that Tika doesn't parse for now .xhtml files. At this
> moment just .html files are parsed. I saw a TODO in the source code at
> EpubContentParser.java and it says :
> /**
> * Parser for EPUB OPS <code>*.html</code> files.
> *
> * For the time being, assume XHTML (TODO: DTBook)
> */

To me that comment says Tika only handles xhtml. (The important thing 
isn't the file extension, but what's in it)

What happens when you try giving the parser one of your xhtml epub files?

  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message