lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: indexing PDF files
Date Wed, 01 May 2002 15:41:56 GMT
> > Hm, this should be a FAQ.
> Maybe it should... ;-)

It is now.

> > Check Lucene contributions page, there are some starting points
> there,
> Well, this seems to be a very popular request... In fact I need 
> something like that also. Unfortunately, there seems to be no 
> authoritative answer as far as converting pdf files to text in a pure
> Java environment... Maybe I'm missing something here as usual?
> Also, on a related note, what would be a good approach to convert any
> random document into pdf? I was thinking to have a two steps process
> for 
> document indexing in Lucene:
> - First, convert everything to pdf (with Acrobat or something)
> - Second, convert pdf to text and index it.
> Any practical suggestions about how to do that in a pure Java 
> environment very welcome.

Wouldn't you want to convert to XML instead and use XSLT to transform
the XML representation to any desired format by just applying a style
Sounds like less work with bigger document type coverage.


Do You Yahoo!?
Yahoo! Health - your guide to health and wellness

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message