lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <>
Subject my experiences - Re: Parsing Word Docs
Date Wed, 05 Mar 2003 23:24:55 GMT
FYI I tried the combo and on a collection of 350 word
docs people have developed here over the years, and it failed on 33% of them
with exceptions being thrown about the formats being invalid.

I tried "antiword" ( ), a native & free 
*.exe, and
it worked great ( well it seemed to process all the files fine).

I've had similar experiences with PDF - I tried the 3 or so 
freeware/java PDF
text extractors and they were not as good as the exe, pdftotext,
from foolabs (

Not satisfying to a java developer but these work better than anything 
else I can find.

You get source and I use them on windows & linux, no prob.

Eric Anderson wrote:

>I'm interested in using the textmining/textextraction utilities using Apache 
>POI, that Ryan was discussing. However, I'm having some difficulty determining 
>what the insertion point would be to replace the default parser with the word 
>Any assistance would be appreciated.
>LanRx Network Solutions, Inc.
>Providing Enterprise Level Solutions...On A Small Business Budget
>To unsubscribe, e-mail:
>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message