lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leo Galambos <>
Subject Re: High Capacity (Distributed) Crawler
Date Mon, 09 Jun 2003 21:56:20 GMT
Hi Otis.

The first beta is done (without NIO). It needs, however, further 
testing. Unfortunatelly, I could not find enough servers which I may hit.

I wanted to commit the robot as a part of egothor (it will use it in 
PULL mode), but we have a nice weather here, so I lost any motivation to 
play with PC ;-).

What interface do you need for Lucene? Will you use PUSH (=the robot 
will modify Lucene's index) or PULL (=the engine will get deltas from 
the robot) mode? Tell me what you need and I will try to do all my best.


Otis Gospodnetic wrote:

>Have you started this project?  Where is it hosted?
>It would be nice to see a few alternative implementations of a robust
>and scalable java web crawler with the ability to index whatever it
>--- Leo Galambos <> wrote:
>>I would like to write $SUBJ (HCDC), because LARM does not offer many 
>>options which are required by web/http crawling IMHO. Here is my
>>1. I would like to manage the decision what will be gathered first - 
>>this would be based on pageRank, number of errors, connection speed
>>2. pure JAVA solution without any DBMS/JDBC
>>3. better configuration in case of an error
>>4. NIO style as it is suggested by LARM specification
>>5. egothor's filters for automatic processing of various data formats
>>6. management of "Expires" HTTP-meta headers, heuristic rules which
>>describe how fast a page can expire (.php often expires faster than
>>7. reindexing without any data exports from a full-text index
>>8. open protocol between the crawler and a full-text engine
>>If anyone wants to join (or just extend the wish list), let me know,
>>To unsubscribe, e-mail:
>>For additional commands, e-mail:
>Do you Yahoo!?
>Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
>To unsubscribe, e-mail:
>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message