lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Text extraction from ms word doc
Date Wed, 13 Jan 2010 11:07:20 GMT
We could also fix WhitespaceAnalyzer to filter that character out?
(Or you could make your own analyzer to do so...).

You could also try asking on the tika-user list whether Tika has a
solution for mapping "extended" whitespace characters...


On Mon, Jan 11, 2010 at 3:04 PM, maxSchlein <> wrote:
> I was looking for an option for Text extraction from a word doc.
> Currently I am using POI; however, when there is a table in the doc, for
> each column POI brings back a  .  The whitespace analyzer is not filtering
> out this character.  So whatever word or phrase that is the last word or
> phrase within a table column is not found during searching.  That is, if the
> word dog is the only word in a column, a search for the word dog would
> return nothing because the word that was indexed was "dog ".
> I can create a filter to fix this, using Apache's
> StringUtils.isAsciiPrintable, but I would rather not.
> Any and all help is welcome and thanked.
> --
> View this message in context:
> Sent from the Lucene - Java Users mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message