cocoon-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Conal Tuohy" <>
Subject Lucene indexing / crawling problem
Date Mon, 09 Jun 2003 05:43:48 GMT
I'm creating a Lucene index using an XSP based on the sample, but I have a strange problem.

Some of the pages are crawled, but some are not crawled, and I can't see why. 

I have DEBUG logging for the components, so I can see the crawler crawling the
site. I can see it read the links for each page, and I can see that it doesn't exclude any
of the links. Yet it doesn't actually follow those links - the crawl simply comes to an end
at some point, with some of the links uncrawled.

It seems to me that for every log entry from SimpleCocoonCrawlerImpl that says "Add URL: http://blah..."
I should also have an entry from SimpleLuceneXMLIndexerImpl that says "Indexing http://blah..."

The home page is crawled, and all of the pages off that page, and SOME of the pages off those
pages, and SOME of the pages off THOSE pages. I can't see why some pages are crawled and others
not. Perhaps the crawler simply stops at some point, and it hasn't finished its list of URLs.
But why would it stop crawling without logging any error? BTW, the last entry in the log is
always the SimpleLuceneXMLIndexerImpl reporting that it has indexed a page, e.g: 

DEBUG   (2003-06-09) 17:32.05:388   [] (/search/reindex.xml) HttpProcessor[80][4]/SimpleLuceneXMLIndexerImpl:
Indexing http://localhost:80/etexts/JCB-016/full.html?cocoon-view=content (text/xml)

Does anyone have any ideas where I could start looking?

I'm using the version RELEASE_2_1_M_2



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message