lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Groschupf>
Subject 182 file formats for lucene!!! was: Re: Exotic format indexing?
Date Thu, 30 Oct 2003 20:02:42 GMT
Hi there,

just to let you know, i had implement for the nutch project a plugin 
that can parse 182 file formats including m$ office.
I simply use open office and use the available java api.

It is really straight forward to use.

Found some info's and a link to the open source code here:

Feel free to recycle the code and give me any feedback.
Hope it will help to free some information from some strange commercial 
formats, since information should be free. ;)


Ben Litchfield wrote:

>Unfortunately, it is not quite so easy.  I am not sure about Word
>documents but PDFs usually have there contents compressed so a raw
>"fishing" around for text would be pointless.  Your best bet is to use a
>package like the one from that handles various formats for
>On Thu, 30 Oct 2003, petite_abeille wrote:
>>Indexing a multitude of esoteric formats (MS Office, PDF, etc) is a
>>popular question on this list...
>>The traditional approach seems to be to try to find some kind of format
>>specific reader to properly extract the textual part of such documents
>>for indexing. The drawback of such an approach is that its complicated
>>and cumborsome: many different formats, not that many Java libraries to
>>understand them all.
>>An alternative to such a mess could be perhaps to convert those
>>multitude of formats into something more or less standard and then
>>extract the text from that. But again, this doesn't seem to be such a
>>straightforward proposition. For example, one could image "printing"
>>every document to PDF and then convert the resulting PDF to text. Not a
>>piece of cake in Java.
>>Finally, a while back, somebody on this list mentioned quiet a
>>different approach: simply read the raw binary document and go fishing
>>for what looks like text. I would like to try that :)
>>Does anyone remember this proposal? Has anyone tried such an approach?
>>Thanks for any pointers.
>>To unsubscribe, e-mail:
>>For additional commands, e-mail:
>To unsubscribe, e-mail:
>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message