nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney (JIRA) <j...@apache.org>
Subject [jira] Commented: (NUTCH-548) Move URLNormalizer from Outlink to ParseOutputFormat
Date Wed, 05 Sep 2007 15:07:35 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525124
] 

Doğacan Güney commented on NUTCH-548:
-------------------------------------

> Maybe I missed something but it seems we do it.
>
>CrawlDb.update defined a JobConf which use CrawlDbFilter as Mapper.Class and set urlnomalizer
and filter. The urlnormalizer 
> and filter flags are pass by the configuration ( i supposed its when we set the plugin
).
>
> Actually i find out while i was testing/debugging this patch, you can see it by running
a simple crawl in debug mode in Eclipse
> and set a debug breakpoint on RegexURLNormalizer.regexNormalizer. 

OK, so I added this simple patch ( http://www.ceng.metu.edu.tr/~e1345172/print.patch ). And
updatedb doesn't print anything unless I pass "-filter" or "-normalize" from command line.
So, I don't think that we do it unless user asks for it.

> You point an interesting thing. Why do we have a scope ? I tried to check the code and
it seems we never really use the scope
> defined in the function. Am i wrong ? 

Scope is just an extra piece of information that may be used by plugins. A url normalizer
plugin may want to treat a url different during invertlinks operation or an updatedb operation
or whatever. I think it is not used by any plugins right now, but it doesn't hurt to keep
it and it is potentially useful (btw, there is an ongoing issue to add scope to url filters
too).

> Beside if you look at the following codein regexNormalize: [...]

I haven't looked at urlnormalizer-regex code in detail so I am not sure about this, but upon
a first glance, I can say that setting-EMPTY_RULES-getting-default-rules part seems unnecessary.

> Move URLNormalizer from Outlink to ParseOutputFormat
> ----------------------------------------------------
>
>                 Key: NUTCH-548
>                 URL: https://issues.apache.org/jira/browse/NUTCH-548
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-548.patch
>
>
> The idea is to avoid instantiating a new URLNormalizer for every OutLink. 
> So I move this operation to the ParseOutputFormat object.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message