lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From petite_abeille <>
Subject Exotic format indexing?
Date Thu, 30 Oct 2003 18:20:43 GMT

Indexing a multitude of esoteric formats (MS Office, PDF, etc) is a 
popular question on this list...

The traditional approach seems to be to try to find some kind of format 
specific reader to properly extract the textual part of such documents 
for indexing. The drawback of such an approach is that its complicated 
and cumborsome: many different formats, not that many Java libraries to 
understand them all.

An alternative to such a mess could be perhaps to convert those 
multitude of formats into something more or less standard and then 
extract the text from that. But again, this doesn't seem to be such a 
straightforward proposition. For example, one could image "printing" 
every document to PDF and then convert the resulting PDF to text. Not a 
piece of cake in Java.

Finally, a while back, somebody on this list mentioned quiet a 
different approach: simply read the raw binary document and go fishing 
for what looks like text. I would like to try that :)

Does anyone remember this proposal? Has anyone tried such an approach?

Thanks for any pointers.



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message