nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcos Bori (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2413) When fetching and parsing together, parameter "parse.filter.urls" is ignored
Date Fri, 25 Aug 2017 15:42:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141783#comment-16141783
] 

Marcos Bori commented on NUTCH-2413:
------------------------------------

Hi [~wastl-nagel],

Thanks for clarifying. You are right: in my code proposal I was applying "parse.filter.urls"
when there is a redirection as the fetch result.
However, I was also applying it when the filters are applied on the resulting outlinks: in
class FetcherThread, method output(), the urlfilters are applied in the outlinks resulting
out of the parsing:

// Process all outlinks, normalize, filter and deduplicate
List<Outlink> outlinkList = new ArrayList<>(outlinksToStore);
HashSet<String> outlinks = new HashSet<>(outlinksToStore);
for (int i = 0; i < links.length && validCount < outlinksToStore; i++) {
	String toUrl = links[i].getToUrl();

	toUrl = ParseOutputFormat.filterNormalize(url.toString(), toUrl,
			origin, ignoreInternalLinks, ignoreExternalLinks, ignoreExternalLinksMode,
					urlFilters, urlExemptionFilters,  normalizers);
	if (toUrl == null) {
		continue;
	}

	validCount++;
	links[i].setUrl(toUrl);
	outlinkList.add(links[i]);
	outlinks.add(toUrl);
}

In order to have an equivalent behaviour when we fetch and parse altogether, or when we do
it separately, and if I'm not wrong, at this point we should be avoid executing the filters
if "parse.filter.urls" is false (and normalizers if "parse.normalize.urls" is false, as well).

In fact, ParseOutputFormat is applying the filters at two points:
	(1) when pstatus.getMinorCode() is ParseStatus.SUCCESS_REDIRECT, it is applied in the redirection
URL
	(2) when the parse succeeds, the filters are applied in all outlinks
But the second (2) case is only executed if "fetcher.parse" is false (that is, when we are
executing fetch and parse separately), because when "fetcher.parse" is true, the filtering
is applied in FetcherThread::output(), as exposed before.

I'm posting a new pull request with the modifications according to this.

Am I right?
	

> When fetching and parsing together, parameter "parse.filter.urls" is ignored
> ----------------------------------------------------------------------------
>
>                 Key: NUTCH-2413
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2413
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher, parser
>    Affects Versions: 1.13
>         Environment: Apache Nutch release 1.13.
>            Reporter: Marcos Bori
>             Fix For: 1.14
>
>
> In a situation when we want to:
> (1) Execute the fetch and parse together ("fetcher.parse" setting to "true")
> (2) Avoid applying the URL filters when executing this phase.
> Condition (2) can be configured when parsing is executed as a separate process by setting
"parse.filter.urls" to "false".
> However, this setting ("parse.filter.urls") is ignored when we execute the fetch and
parse phases together. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message