lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Øie <>
Subject Re: Indexing distant web sites
Date Mon, 04 Nov 2002 13:39:01 GMT
oh, sorry.. i was perhaps not making me self clear here...

you will have to use the crawler to retrieve the content and store it  
locally for indexing, so you will have to set up your crawler to fetch  
a site and store every html page's content to disk, then run Lucene on  
the locally stored html pages and afterwards delete the html pages...  
you will also need a way to get the original url from the crawler and  
store that in Lucene as well as a keyword field.

a much more efficient way is to get the crawler to get one page, store  
it in memory, run Lucene on it, and then discard the buffer and then  
keep on to the next page.

if you want to take a look at a real lucene+ crawler implementation you  
can check out the Cocoon project at :

Lucene integration:


Crawler implementation:


This impl is indexing XML, but the principe is the same...

mvh karl øie

On Monday, Nov 4, 2002, at 14:29 Europe/Oslo, Friaa Nafaa wrote:

> Thank you,I was installed this crawler and I run it,but I would like  
> to index the web site and not to list the visited links by the  
> crawler,Is there a way to serch a web page by lucene witch use this  
> crawler for visiting the pages.thanks--- On Mon 11/04, Karl Marx &lt;  
> &gt; wrote:From: Karl Marx [mailto:]To:  
> lucene-user@jakarta.apache.orgDate: Mon, 4 Nov 2002 12:31:50  
> +0100Subject: Re: Indexing distant web sitesAs stated in the official  
> FAQ Lucene doesn't implement a web-crawler, you can however use a  
> self-made crawler or customate a crawler framework like websphinx  
> ( to retrieve html documents  
> from a site and then feed them to Lucene.mvh karl ¯ieOn Monday, Nov 4,  
> 2002, at 11:49 Europe/Oslo, Friaa Nafaa wrote:&gt; Hello,is there any  
> way to index web sites by lucene, assuming we know &gt; only the url  
> of the site ? :--&amp;gt;In local use we passe to lucene the &gt; full  
> arborexcence or directory of our site (contain all the documents) &gt;  
> and we begin the indexing operation, but when I would like to index a  
> &gt; distant site on the web... what i do ?For exemple I installed  
> Lucene &gt; on my computer and I would like to index the site : &gt;  
> ...Thanks&gt;&gt;  
> _______________________________________________&gt; Join Excite! -  
>; The most personalized portal on the Web!--To  
> unsubscribe, e-mail: For additional commands, e-mail:
> _______________________________________________
> Join Excite! -
> The most personalized portal on the Web!

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message