nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Donghyeok Kang <>
Subject A link that begins with the question mark(?) can't be crawled.
Date Thu, 21 May 2009 13:51:22 GMT
Hi, all.
If the "href" attribute value of a A tag begins with the question mark(?) in
a HTML document, web browsers treat it as a query string and make no

But nutch generates a malformed url with it because of java.util.URL class,
so it cannot crawl the right page.

Let's see the source code of

403 URL url = (base.toString().indexOf(';') > 0) ?
404 fixEmbeddedParams(base, target) :  new URL(base, target);
405 outlinks.add(new Outlink(url.toString(),
406 linkText.toString().trim()));

And see,

       public URL(URL context, String spec)
       If the spec's path component begins with a slash character "/" then
the path is treated as absolute and the spec path replaces the context path.
       Otherwise, the path is treated as a relative path and is appended to
the context path, as described in RFC2396.
       Also, in this case, the path is canonicalized through the removal of
directory changes made by occurences of ".." and ".".

Because of this Constructor, nutch got the malformed url.

For example, if the base url is "http://some.domain/dir/page?param=value"
and the target url is "?param1=value1&param2=value2",
new URL(base,target) makes "
http://some.domain/dir/?param1=value1&param2=value2", not "

And then nutch would crawl a wrong url.

I think DOMContentUtils.getOutlinks() method should be modified.

Thanks in advance.

- Donghyeok Kang

View raw message