tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From abc <imrank...@gmail.com>
Subject How to Convert Doc or Docx File to HTML?
Date Sun, 29 Jan 2012 11:24:54 GMT
I need to convert doc/docx into html. I was able to convert doc into html
using Apache poi. But I am unable to convert docx to html. Some suggest me
to use XWPFWordExtractorDecorator class which convert docx to html. I was
able to reuse XWPFWordExtractorDecorator class. But it is just giving me
simple text. How to get the HTML? Here is what I did so far,


public class XWPFWordExtractorDecoratorChild extends
XWPFWordExtractorDecorator{
    public XWPFWordExtractorDecoratorChild(ParseContext context,
XWPFWordExtractor extractor) {
                super(context, extractor);
    }
    public void buildHTML(XHTMLContentHandler xhtml)
            throws SAXException, XmlException, IOException {
        this.buildXHTML(xhtml);        
    }
}


ParseContext p = new ParseContext();
XWPFDocument doc = new XWPFDocument(stream);
XWPFWordExtractor ex = new XWPFWordExtractor(doc);
XWPFWordExtractorDecoratorChild dec = new XWPFWordExtractorDecoratorChild(p,
ex);            
StringWriter writer = new StringWriter();
Metadata meta = new Metadata();
XHTMLContentHandler h = new XHTMLContentHandler(new
BodyContentHandler(writer), meta);
dec.buildHTML(h);
String s= writer.toString();

Any help to the to convert doc/docx into Html with style is appreciated. 

--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-Convert-Doc-or-Docx-File-to-HTML-tp3697301p3697301.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Mime
View raw message