tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph Vychtrle (JIRA)" <j...@apache.org>
Subject [jira] [Closed] (TIKA-690) WordExtractor doesn't extract text from HWPFDocument
Date Sun, 14 Aug 2011 18:30:32 GMT

     [ https://issues.apache.org/jira/browse/TIKA-690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joseph Vychtrle closed TIKA-690.
--------------------------------

    Resolution: Not A Problem

WordExctractor requires HWPF document with paragraphs

> WordExtractor doesn't extract text from HWPFDocument
> ----------------------------------------------------
>
>                 Key: TIKA-690
>                 URL: https://issues.apache.org/jira/browse/TIKA-690
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9, 1.0
>            Reporter: Joseph Vychtrle
>              Labels: parsing
>
> If I use apache poi's HWPF component to create MS doc, and pass it to tika.parseToString(is);
 it returns just carriage return "\n". I tested that with tons of different input text. Adding
paragraphs doesn't help.
> {code}
> private void createDOCDocument(String from, File file) throws Exception {
>     POIFSFileSystem fs = new POIFSFileSystem(DOCGenerator.class.getClass().getResourceAsStream("/poi/template.doc"));
>     HWPFDocument doc = new HWPFDocument(fs);
>     Range range = doc.getRange();
>     CharacterRun run1 = range.insertBefore(from);
>     run1.setFontSize(11);
>     DocumentSummaryInformation dsi = doc.getDocumentSummaryInformation();
>     CustomProperties cp = dsi.getCustomProperties();
>     if (cp == null)
>         cp = new CustomProperties();
>     cp.put("myProperty", "foo bar baz");
>     dsi.setCustomProperties(cp);
>     doc.write(new FileOutputStream(file));
> }
> {code}
> {code}
> protected String extractText(InputStream is) throws SystemException {
> 	Tika tika = new Tika();
> 	tika.setMaxStringLength(new Long(maxCharCount).intValue());
> 	String text;
> 	try {
> 		text = tika.parseToString(is);
> 	} catch (IOException ioe) {
> 		throw new SystemException(ioe.getMessage(), ioe);
> 	} catch (TikaException te) {
> 		throw new SystemException(te.getMessage(), te);
> 	}
> 	return text;
> }
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message