lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Armbrust, Daniel C." <>
Subject RE: MS Word Search ??
Date Thu, 30 May 2002 15:50:28 GMT
This might be worth looking into for those who need to parse word, excel,
powerpoint, or other MS file types of microsofts.

openoffice - knows how to parse all of the microsoft
formats (at least all that I've tried so far) - and then, you can a do a
save as, and write out the open office format, which is a couple of xml
files zipped together.  So, this makes me think of two possible ways that
you could get at the content of the MS files in a text form you can index
(neither of which I have tried or even looked to see if they are possible)

#1 - get the code for openoffice - it is open source - and use it for
parsing the MS documents into xml which could then be indexed

#2 - if open office is programmatically drivable (which I don't know if it
is), fire up a copy of open office and use it to convert the files as

Just some suggestions.  Does anyone know much more about openoffice?  I
would be interested in knowing if either of these would be feasible.  


-----Original Message-----
From: Ewout Prangsma []
Sent: Wednesday, May 29, 2002 1:00 PM
To: Lucene Users List
Subject: Re: MS Word Search ??

Op Wednesday 29 May 2002 11:56, Karl Øie schreef:
> b: convert the documents to something that is accessable through java like
> xml, etc...

We're using wvWare ( to convert word to html (or text) and index 
that and xpdf for converting PDF to text and index that. Any links on 
indexing using POI converters (or other java converters) are very welcome!


> the best way is to convert as the java api's for MSOffice documents still
> are under development
> mvh karl øie
> On Wednesday 29 May 2002 11:48, Rama Krishna wrote:
> > Hi,
> >
> > I am trying to build a search engine which search in MS Word, excel, ppt
> > and adobe pdf. I am not sure whether i can use Lucene for this or not. 
> > pl. help me out in this regard.
> >
> >
> > Regards,
> > Ramakrishna
> >
> >
> > _________________________________________________________________
> > Chat with friends online, try MSN Messenger:

Ewout Prangsma, Directeur
Daisy Software
Telefoon/fax: +31-77-3270305/3270306
KvK Venlo nr. 12046144 

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message