nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcin Okraszewski (JIRA)" <j...@apache.org>
Subject [jira] Updated: (NUTCH-488) Avoid parsing uneccessary links and get a more relevant outlink list
Date Mon, 15 Oct 2007 20:26:50 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Marcin Okraszewski updated NUTCH-488:
-------------------------------------

    Attachment: ignore_tags_v3.patch

OK, yet another approach based on Do─čacan comments. Sorry for delay, but I didn't notice
the comment earlier.

- I didn't notice the conf.getStrings() method. Thanks for hint :)
- I did made the backward compatibility with the "use_action" param, but it works a bit different
now, if there is no value set. Now, default is that it should use the forms. But it can be
dropped with ignore_tags setting if not specified. If someone has the use_action set to true
explicite, then it cannot be overridden by the ignore_tags. It is still a bit inconsitent,
but it is understandable that specific setting (use_action) has precedence. If default is
"false" then if you do not have "use_action" defined and not added to ignore_tags, then one
could expect that form is taken. But it wouldn't be. Keeping the backward compatibility make
the code a bit clumsy :( ... and I think I've made it over flexible, but that was the cleanest
solution here.
- For the repeating if; I agree, it is error prone, but on the other hand it is easy to understand.
I didn't quite understand Dogacan's proposal :( but I think I did something acceptable - simply
remove all specified tags from link params. 




> Avoid parsing uneccessary links and get a more relevant outlink list
> --------------------------------------------------------------------
>
>                 Key: NUTCH-488
>                 URL: https://issues.apache.org/jira/browse/NUTCH-488
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>         Environment: Windows, Java 1.5
>            Reporter: Emmanuel Joke
>         Attachments: DOMContentUtils.patch, ignore_tags_v2.patch, ignore_tags_v3.patch,
nutch-default.xml.patch
>
>
> NekoHTML parser use a method to extract all outlinks from the HTML page. It will extracts
them from the HTML content based on the list of param defined in the method setConf(). Then
this list of links will be truncated to be limit to the the maximum number of outlinks that
we'll process for a page defined in nutch-default.xml (db.max.outlinks.per.page = 100 by default
) and finally it will be go through all urlfilter defined.
> Unfortunetly it can happen that the list of outlinks is more than 100, so it will truncated
the list and could remove some relevant links.
> So I've added few options in the nutch-default.xml in order to enable/disable the extraction
of specific HTML Tag links in this parser (SCRIPT, IMG, FORM, LINK).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message