nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Donghyeok Kang <wolfk...@gmail.com>
Subject A link that begins with the question mark(?) can't be crawled.
Date Thu, 21 May 2009 13:51:22 GMT
Hi, all.
If the "href" attribute value of a A tag begins with the question mark(?) in
a HTML document, web browsers treat it as a query string and make no
problem.

But nutch generates a malformed url with it because of java.util.URL class,
so it cannot crawl the right page.

Let's see the source code of
org.apache.nutch.parse.html.DOMContentUtils.getOutlinks.

403 URL url = (base.toString().indexOf(';') > 0) ?
404 fixEmbeddedParams(base, target) :  new URL(base, target);
405 outlinks.add(new Outlink(url.toString(),
406 linkText.toString().trim()));

And see http://java.sun.com/javase/6/docs/api/java/net/URL.html#URL(java.net.URL,
java.lang.String)

       public URL(URL context, String spec)
       If the spec's path component begins with a slash character "/" then
the path is treated as absolute and the spec path replaces the context path.
       Otherwise, the path is treated as a relative path and is appended to
the context path, as described in RFC2396.
       Also, in this case, the path is canonicalized through the removal of
directory changes made by occurences of ".." and ".".

Because of this Constructor, nutch got the malformed url.

For example, if the base url is "http://some.domain/dir/page?param=value"
and the target url is "?param1=value1&param2=value2",
new URL(base,target) makes "
http://some.domain/dir/?param1=value1&param2=value2", not "
http://some.domain/dir/page?param1=value1&param2=value2".

And then nutch would crawl a wrong url.

I think DOMContentUtils.getOutlinks() method should be modified.

Thanks in advance.

- Donghyeok Kang

Mime
View raw message