nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Closed: (NUTCH-378) MetaWrapper decorator
Date Tue, 14 Nov 2006 19:48:38 GMT
     [ http://issues.apache.org/jira/browse/NUTCH-378?page=all ]

Andrzej Bialecki  closed NUTCH-378.
-----------------------------------

    Fix Version/s: 0.9.0
       Resolution: Fixed

Added with modifications to trunk/, rev. 474934 .

> MetaWrapper decorator
> ---------------------
>
>                 Key: NUTCH-378
>                 URL: http://issues.apache.org/jira/browse/NUTCH-378
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>         Attachments: MetaWrapper.java
>
>
> First, a bit of background.
> Currently some tools (Indexer, SegmentMerger, CrawlDbReducer) use ObjectWritable to pass
data from different parts of segment(s) to map-reduce methods. However, there is a high risk
that this data is processed incorrectly, because map-reduce methods no longer know the exact
source of any given data item.
> Example: Indexer may process many segments at the same time. In its reduce() method it
receives a set of values coming from different parts of the segment, but found at the same
key (url). However, if the same page is fetched multiple times, Indexer will receive multiple
sets of values from different segments (e.g. multiple fetchDatum, parseData, etc). It may
happen that some of this data items it picks up for further processing belong to one set,
and some other data to another, resulting in the final set that is a hodge-podge of partial
data coming from different segments. This could be avoided if each value had metadata to mark
it as belonging to a particular segment. Indexer could then collect all complete multiple
sets, and then select the most recent one for further processing.
> Similar situation occurs in SegmentMerger, where data coming from different segments
is tagged with its source. However, ParseText class doesn't support any metadata, so its text
has to be changed to contain the tag. This is unwieldy and far from elegant.
> A different problem occurs in CrawlDbReducer - we have instances of the same class, but
it's sometimes difficult to determine where they originally came from. This also limits us
to update CrawlDb from 1 segment at a time, otherwise CrawlDatum instances from earlier segments
would be indistinguishable from those from newer segments... In short, the functionality and
internal logic here could be vastly improved if we knew where any CrawlDatum instance came
from.
> The attached class provides this functionality - instead of using ObjectWritable (or
plain CrawlDatum) we can wrap instances of input data in MetaWritable, and add necessary metadata
that will support the processing at hand. Then in map-reduce methods we can unpack original
values, and use additional metadata.
> Note: this wrapping/unwrapping is aplied only during map-reduce jobs - data stored in
DBs and segments would remain the same.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message