nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabrice EstiƩvenart <...@info.fundp.ac.be>
Subject getting a list of matching URLs from a start URL
Date Tue, 08 Mar 2005 12:47:04 GMT
Hello,

 From a list of start URLs (each associated with a regular expression), 
I'd like to get - for each start URL - all URLs that come from the same 
domain and that match the expression...I don't wanna analyse or index 
the URLs, just to write them down in a flat file.

Example :
start URL : http://www.mydomain.com
regular expresssion : /files/*.html

gives :
- http://www.mydomain.com/files/index.html
- http://www.mydomain.com/files/a.html
- http://www.mydomain.com/files/a01.html
- http://www.mydomain.com/files/b.html
- ...

How can I do that simply with Nutch without "reinventing the wheel" ? 
Should I extend an existing class ? develop a plugin ? Could you give me 
some tips please ?

Thanks a lot for this useful forum !!!

Fabrice

Mime
View raw message