lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Libby <>
Subject Re: indexing and searching different file formats
Date Wed, 13 Feb 2002 16:20:06 GMT

    Currently Lucene does not provide the ability to convert documents
to text for indexing.  There is talk of adding this kind of thing to the
goal of the project, along with providing crawlers to traverse web, 
local disk, ftp, and RDBMS sources of data.

The problem with indexining irrespective of file type is that each document
format contains embedded information that must be stripped out (or ignored)
and the text needs to be retrieved for indexing.  An extreeme example is
a PDF which has a considerably complicated document format.

On the contributions page there are some pointers that may provide information
about processing the types of documents you're interested in.

If you've not taken the time to do so, look at the FAQs, they are very

Good luck!


On Wed, Feb 13, 2002 at 09:24:33PM +0530, Pradeep Kumar K wrote:
> Hi Lucene friends!
>    How the files of different format can be indexed and searched? ( As I 
> know lucene is having HTML indexer and searcher, which comes along with 
> it and also XML indexer, but is there any way to index files  
> irrespective of the file type)
> Any suggestions will be greatly appreciated..
> Thanks in advance.
> Pradeep
> --------------------------------------------------------------
> Robosoft Technologies, Mangalore, India
> --
> To unsubscribe, e-mail:   <>
> For additional commands, e-mail: <>

Andrew Libby
CommNav, Inc

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message