nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Nagel <wastl.na...@googlemail.com>
Subject Re: Why are web urls not assumed to be http
Date Sat, 26 Apr 2014 21:17:06 GMT
Hi Diaa,

> Why doesn't nutch assume that web links that have www. at the beginning are
> of the http protocol?

It would be not a big problem to do so. The url normalizer provides scopes
(inject, fetch, etc.): you only have to point the property
"urlnormalizer.regex.file.inject" to a special regex-normalize-inject.xml
(or any other choice for the filename). In this file you can define any such
rules as described.

Why there are no such specific rules for injector?
- maybe just because no one did it or wants to maintain the rule set
  (to define a commonly accepted set of rules isn't easy:
   you can ever continue, e.g. what about adding also www. if it's missing)
- seeds are fully controlled by the crawl administrators, it's
  comparatively simple to teach them to use fully specified URLs.
  Much simpler than explaining usage of URL filters.

Sebastian

On 04/25/2014 11:53 AM, Diaa Abdallah wrote:
> Hi,
> I tried injecting www.google.com into my crawldb without prepending
> http://to it.
> It injected it fine, however when I ran generate on it it gave the
> following warning:
> "Malformed URL: 'www.google.com', skipping (java.net.MalformedURLException:
> no protocol: www.google.com"
> 
> Why doesn't nutch assume that web links that have www. at the beginning are
> of the http protocol?
> 
> Thanks,
> Diaa
> 


Mime
View raw message