nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <>
Subject [jira] [Commented] (NUTCH-2675) Give parsers the capability to read and write CrawlDatum
Date Mon, 19 Nov 2018 21:30:00 GMT


Sebastian Nagel commented on NUTCH-2675:

Yes, that could be done but it would also require to change the interfaces of the parser and/or
ParseFilter plugins.

[~aquaticwater], did you consider to implement a scoring filter to do this job? Although the
[ScoringFilter|] interface is originally
thought to transfer and distribute the score from the CrawlDb over fetch datum, parsed page
back to the crawldb (both via outlinks and the CrawlDatum of the fetched page), it can be
also used to transfer metadata. The [DepthScoringFilter|;a=blob;f=src/plugin/scoring-depth/src/java/org/apache/nutch/scoring/depth/;h=07e0e3f04effe6526088a0c088ec506952d55424;hb=HEAD]
is a good example for this approach. It does not look straight-forward at a first glance and
you need to pass the information along over multiple hops/methods but it has the advantage
to work under any conditions.

> Give parsers the capability to read and write CrawlDatum
> --------------------------------------------------------
>                 Key: NUTCH-2675
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.15
>            Reporter: Junqiang Zhang
>            Priority: Minor
>             Fix For: 1.16
> Parsers are called inside org.apache.nutch.parse.ParseSegment,
> (Line 127 for version 1.15)        parseResult = parseUtil.parse(content);
> and inside org.apache.nutch.fetcher.FetcherThread.
> (Line 640 for version 1.15)            parseResult = this.parseUtil.parse(content);
> The current version of Nutch does not give parsers the capability to access CrawlDatum.
If users want to customize the parsing process using some metadata of CrawlDatum, it is difficult
to read the required metadata. 
> On the other side, if users want to save metadata generated during parsing, the metadata
can only be saved as parseMeta of org.apache.nutch.parse.ParseData, and those of parseMeta
selected by in nutch-site.xml can be added to CrawlDatum inside org.apache.nutch.parse.ParseOutputFormat
and org.apache.nutch.crawl.CrawlDbReducer. If parsers have direct access to CrawlDatum, the
metadata generated during parsing can be added to CrawlDatum directly by parsers.
> I use Nutch to fetch and parse web pages. To read required metadata from CrawlDatum during
parsing, I do the following steps to work around.
> (1) During web page fetching, inside org.apache.nutch.protocol.http.api.HttpBase of lib-http
plugin, read the required metadata from CrawlDatum, and save the required metadata together
with the Headers metadata of to the metadata of org.apache.nutch.protocol.Content.
This can be done at line 334 of the code by replacing "response.getHeaders()" by a new metadata
containing both the required metadata from CrawlDatum and the Headers metadata.
> The code need to be modified inside org.apache.nutch.protocol.http.api.HttpBase of lib-http
plugin is
> (Line 332 for version 1.15)      Content c = new Content(u.toString(), u.toString(),
> (Line 333 for version 1.15)           (content == null ? EMPTY_CONTENT : content),
> (Line 334 for version 1.15)           response.getHeader("Content-Type"), response.getHeaders(),
> (2) During html page parsing, inside org.apache.nutch.parse.html.HtmlParser of parse-html
plugin, read the required metadata from the metadata of org.apache.nutch.protocol.Content,
and customize the parsing process using the required metadata.
> If parsers have direct access to CrawlDatum, the above workaround is not needed. To give
parsers the capacity to directly read and write CrawlDatum, I would like to suggest adding
a new method "public ParseResult parse(Content content, CrawlDatum datum)" to org.apache.nutch.parse.ParseUtil
in future versions of Nutch.
> To be compatible with current 1.15 and previous versions, I would like to suggest adding
a new configuration property to nutch-default.xml. The default of the configuration property
can be use the current method "public ParseResult parse(Content content)". If users want to
use "public ParseResult parse(Content content, CrawlDatum datum)", they can change the property
in nutch-site.xml.

This message was sent by Atlassian JIRA

View raw message