lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Litchfield <>
Subject Re: Exotic format indexing?
Date Thu, 30 Oct 2003 19:48:41 GMT
Unfortunately, it is not quite so easy.  I am not sure about Word
documents but PDFs usually have there contents compressed so a raw
"fishing" around for text would be pointless.  Your best bet is to use a
package like the one from that handles various formats for


On Thu, 30 Oct 2003, petite_abeille wrote:

> Hello,
> Indexing a multitude of esoteric formats (MS Office, PDF, etc) is a
> popular question on this list...
> The traditional approach seems to be to try to find some kind of format
> specific reader to properly extract the textual part of such documents
> for indexing. The drawback of such an approach is that its complicated
> and cumborsome: many different formats, not that many Java libraries to
> understand them all.
> An alternative to such a mess could be perhaps to convert those
> multitude of formats into something more or less standard and then
> extract the text from that. But again, this doesn't seem to be such a
> straightforward proposition. For example, one could image "printing"
> every document to PDF and then convert the resulting PDF to text. Not a
> piece of cake in Java.
> Finally, a while back, somebody on this list mentioned quiet a
> different approach: simply read the raw binary document and go fishing
> for what looks like text. I would like to try that :)
> Does anyone remember this proposal? Has anyone tried such an approach?
> Thanks for any pointers.
> Cheers,
> PA.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message