tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph Vychtrle (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-690) WordExtractor doesn't extract text from HWPFDocument
Date Sun, 14 Aug 2011 13:19:27 GMT
WordExtractor doesn't extract text from HWPFDocument

                 Key: TIKA-690
                 URL: https://issues.apache.org/jira/browse/TIKA-690
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.9, 1.0
            Reporter: Joseph Vychtrle

If I use apache poi's HWPF component to create MS doc, and pass it to tika.parseToString(is);
 it returns just carriage return "\n". I tested that with tons of different input text. Adding
paragraphs doesn't help.

private void createDOCDocument(String from, File file) throws Exception {

    POIFSFileSystem fs = new POIFSFileSystem(DOCGenerator.class.getClass().getResourceAsStream("/poi/template.doc"));
    HWPFDocument doc = new HWPFDocument(fs);

    Range range = doc.getRange();
    CharacterRun run1 = range.insertBefore(from);

    DocumentSummaryInformation dsi = doc.getDocumentSummaryInformation();
    CustomProperties cp = dsi.getCustomProperties();
    if (cp == null)
        cp = new CustomProperties();
    cp.put("myProperty", "foo bar baz");

    doc.write(new FileOutputStream(file));

protected String extractText(InputStream is) throws SystemException {
	Tika tika = new Tika();
	tika.setMaxStringLength(new Long(maxCharCount).intValue());
	String text;
	try {
		text = tika.parseToString(is);
	} catch (IOException ioe) {
		throw new SystemException(ioe.getMessage(), ioe);
	} catch (TikaException te) {
		throw new SystemException(te.getMessage(), te);
	return text;


This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message