nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doğacan Güney" <>
Subject Re: Limiting outlink tags.
Date Fri, 07 Sep 2007 07:55:45 GMT
Hi Marcin,

On 9/7/07, Marcin Okraszewski <> wrote:
> Hi,
> I have noticed that Nutch considers img/@src as an outlink. I suppose in many cases people
do not want to threat image as an outlink. At least I don't want. The same case is with script/@src.
But, it seems there is no way to limit outlink tags. The DOMContentUtils.getOutlinks() takes
links from all a,area,form,frame,iframe,script,link,img. Only "form" element can be turned
off by "parser.html.form.use_action" parameter.
> I would suggest to introduce a new configuration parameter which could be used to turn
on or off certain elements. It could be simply done by single parameter, which would contain
coma separated list of tags to be turned off.
> What is your opinion? If you think it is a valid issue I can make a patch for this.

There is already NUTCH-488 open for this (with a patch). Feel free to
add comments/patches/etc. there. Btw, I agree that using a CSV is
better than using a new configuration parameter for every tag.

> Regards,
> Marcin

Doğacan Güney
View raw message