nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From EM <>
Subject Re: Nutch Crawler, Page Rediction and Pagination
Date Mon, 26 Sep 2005 04:19:11 GMT

>>I know that if you are big user (several dedicated machines in a data
>>center with fast connection...) you probably don't care about this, your
>>crawler will run over any website, with 50-500 threads the default three
>>retry times, and the problem will solve itself out. But, can something
>>be done for the rest or us, please?
>No, I don't think so. Some web designer put the "url director" as 
>obstacle of search engine. It is common in China. And you cannot get
>conten of these websites at all.
Maybe I wasn't totally clear, with 10 seconds of timeout , the fetcher 
will jump over bunch of pages on the same host. Any obstacles will be 
pretty much ignored because these pages won't be fetched and pages 
leading from them won't be fetched also. On large scale, search engine 
traps or not, the fetcher will play rough an get over them in 3 runs 
(actually a bit more since some pages will be fetched). This is of 
course the case if you don't need 100% of the pages, just as many as you 
can fetch.

People who are technically able to put search engine traps should be 
technically able to put robots.txt, of course with both sides not 
obeying the rules, it's a bit of a mess lately and everyone are paying 
the price.

I've encountered cases where spam was the issue and not search engine 
traps. There's this website that has mod rewrite or something like that 
setup so ANY RANDOM link you can type on his page is valid and it will 
show you bunch of unrelated random advertisements. This is static page 
by the way. Now, if I had 100mbps my fetcher would run over his website 
without blinking, being limited to 2 the effect is noticeable. No matter 
how many times I run the fetcher, the number of instructions left wasn't 
decreasing ;) I've encountered cases like this, and instead of manually 
typing regex to clean them off (which takes time) I'd strongly prefer an 
automated solution if possible.


View raw message