nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: crawling protected pages
Date Mon, 12 Sep 2005 19:04:59 GMT
Edward Quick wrote:
> Hi,
> 
> I posted to the user list but didn't get a reply. I want to crawl a 
> protected site, but there doesn't seem to be an option for that in Nutch 
> at the moment.
> 
> However, it doesn't sound like something that would be too hard to add, 
> assuming the java http client library can handle that. As I'm not 
> familiar with the code, could someone point me at the file (or files) in 
> the source which do the crawling please? I'm not professing to be a top 
> Java programmer (perl's my speciality) but I'll give it a shot, unless 
> anyone else wants to?!

The quick hack would be to add necessary code somewhere in 
protocol-httpclient. Eventually though, I think Nutch should grow an 
authentication factory, which would supply needed credentials to other 
plugins.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Mime
View raw message