lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From maxSchlein <>
Subject Text extraction from ms word doc
Date Mon, 11 Jan 2010 20:04:34 GMT

I was looking for an option for Text extraction from a word doc.  

Currently I am using POI; however, when there is a table in the doc, for
each column POI brings back a .  The whitespace analyzer is not filtering
out this character.  So whatever word or phrase that is the last word or
phrase within a table column is not found during searching.  That is, if the
word dog is the only word in a column, a search for the word dog would
return nothing because the word that was indexed was "dog".

I can create a filter to fix this, using Apache's
StringUtils.isAsciiPrintable, but I would rather not.

Any and all help is welcome and thanked.
View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message