tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph Vychtrle (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-690) WordExtractor doesn't extract text from HWPFDocument
Date Sun, 14 Aug 2011 17:33:27 GMT

    [ https://issues.apache.org/jira/browse/TIKA-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084869#comment-13084869
] 

Joseph Vychtrle commented on TIKA-690:
--------------------------------------

I was using tika snapshot so that poi 3.8-beta3 ... Anyway, first of all tika WordExtractor
doesn't extract anything from .doc unless it has paragraphs. I finally make it work like this
 :
{code}
private void createDOCDocument(String from, File file) throws Exception {

	POIFSFileSystem fs = new POIFSFileSystem(DOCGenerator.class.getClass().getResourceAsStream("/poi/template.doc"));
	HWPFDocument doc = new HWPFDocument(fs);

	Range range = doc.getRange();
	Paragraph par1 = range.getParagraph(0);

	CharacterRun run1 = par1.insertBefore(from, new CharacterProperties());
	run1.setFontSize(11);
	doc.write(new FileOutputStream(file));
{code}


So that even if you have exactly the same looking .doc, but the text goes directly into range.insertBefore();
WordExtractor doesn't extract it.

> WordExtractor doesn't extract text from HWPFDocument
> ----------------------------------------------------
>
>                 Key: TIKA-690
>                 URL: https://issues.apache.org/jira/browse/TIKA-690
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9, 1.0
>            Reporter: Joseph Vychtrle
>              Labels: parsing
>
> If I use apache poi's HWPF component to create MS doc, and pass it to tika.parseToString(is);
 it returns just carriage return "\n". I tested that with tons of different input text. Adding
paragraphs doesn't help.
> {code}
> private void createDOCDocument(String from, File file) throws Exception {
>     POIFSFileSystem fs = new POIFSFileSystem(DOCGenerator.class.getClass().getResourceAsStream("/poi/template.doc"));
>     HWPFDocument doc = new HWPFDocument(fs);
>     Range range = doc.getRange();
>     CharacterRun run1 = range.insertBefore(from);
>     run1.setFontSize(11);
>     DocumentSummaryInformation dsi = doc.getDocumentSummaryInformation();
>     CustomProperties cp = dsi.getCustomProperties();
>     if (cp == null)
>         cp = new CustomProperties();
>     cp.put("myProperty", "foo bar baz");
>     dsi.setCustomProperties(cp);
>     doc.write(new FileOutputStream(file));
> }
> {code}
> {code}
> protected String extractText(InputStream is) throws SystemException {
> 	Tika tika = new Tika();
> 	tika.setMaxStringLength(new Long(maxCharCount).intValue());
> 	String text;
> 	try {
> 		text = tika.parseToString(is);
> 	} catch (IOException ioe) {
> 		throw new SystemException(ioe.getMessage(), ioe);
> 	} catch (TikaException te) {
> 		throw new SystemException(te.getMessage(), te);
> 	}
> 	return text;
> }
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message