nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doğacan Güney" <>
Subject Re: Scoring API issues (LONG)
Date Tue, 18 Sep 2007 19:40:23 GMT

I think the ideas here are brilliant. A big +1 from me. I have one
minor suggestion that I detail below.

On 9/13/07, Andrzej Bialecki <> wrote:
> Hi all,
> I've been working recently on a custom scoring plugin, and I found out
> some issues with the scoring API that severely limit the way we can
> calculate static page scores. I'd like to restart the discussion about
> this API, and propose some changes. Any comments or suggestions are welcome!
> Currently we use the ScoringFilters API in several places in the code,
> in order to adjust page scores during different operations (such as
> generating, parsing, updating the db, and indexing). The API can be
> divided into two distinct parts - the part that prioritizes pages to be
> re-fetched, and the part that calculates static page scores that convey
> the idea of query-independent page "importance" or "quality". Of these
> two, the second one is more complex and requires better support in the code.
> One general comment: all of the issues below could be handled by
> special-purpose tools that execute several map-reduce jobs to read the
> data, build intermediate databases, and update crawldb / linkdb as
> necessary. However, it would be good to investigate if we can come up
> with an abstraction that is flexible enough to do this for different
> scoring algorithms, and still doesn't require running many additional jobs.
> 1. Partial link graph issue
> ---------------------------
> Historically speaking, this API was abstracted from the existing Nutch
> code, which implemented the OPIC-like scoring. As such, it works on the
> premise that it's possible to correctly calculate a static page score
> given only the following information:
> * page history recorded in CrawlDb,
> * new page information from the current segment,
> * partial inlink information available in the segments currently being
> processed (in updatedb).
> This works well for the OPIC (if even there - the jury's still out ;) ),
> but not for many other scoring algorithms, including the most popular
> ones like PageRank or HITS, or even the inlink-degree algorithm from
> Nutch 0.7. Those algorithms require processing of the complete web graph
> information (i.e. the inlink / outlink info) that's been collected so far.
> I propose the following change: let's update the LinkDb in the same step
> as CrawlDb, so that both are up-to-date with the most current
> information after finishing the updatedb step. In terms of the input /
> output data involved, this change does increase the amount of data to be
> processed, by the size of LinkDb. However, the benefit of this change is
> that for each page to be updated we have access to its complete inlink
> information, which enables us to use other scoring algorithms that
> require this data. Also, we don't have to run invertlinks anymore.
> So, the updatedb process would look like this:
>         - CrawlDb: <Text url, CrawlDatum dbDatum>
>         - LinkDb:  <Text url, Inlinks inlinks>
>          - segment/crawl_fetch:
>                    <Text url, CrawlDatum fetchDatum>
>          - segment/crawl_parse:
>                    <Text url, CrawlDatum inlink>
> MAP: simply collects the input data in records that are able to
> accommodate all this information:
>         <Text url, <Inlinks oldInlinks, List newInlinks,
>                 CrawlDatum dbDatum, CrawlDatum fetchDatum> >
> REDUCE: uses a modified version of CrawlDbReducer, which first collapses
> all incoming records to a single record in the above format, i.e. it
> collects all incoming records and fills in the slots for dbDatum,
> fetchDatum, oldInlinks and newInlinks. The we pretty much reuse the rest
> of the existing logic in CrawlDbReducer - but at the end of the reduce()
> we can pass the full inlink information to ScoringFilters.updateDbScore.
> Finally, we aggregate all inlink information and output the following
> record:
>         <Text url, <Inlinks newInlinks, CrawlDatum newDbDatum> >
> OUTPUT: we use a special OutputFormat that splits output records into
> <url, newInlinks> and <url, newDbDatum> and creates new versions of both
> CrawlDb and LinkDb.
> 2. Lack of global properties
> ----------------------------
> Neither CrawlDb, nor LinkDb, nor segments keep around the most basic
> global statistics about them. Currently, in order to tell how many pages
> we have in the db it's necessary to run a mapred job (readdb -stats) -
> even though this information is static and could've been calculated in
> advance for a given generation of db-s or for each segment. This
> complicates even simple tasks such as readseg -list, and makes it
> difficult to keep around global score-related statistics for db-s and
> segments.
> So, I propose to add a metadata file located in a well-known location
> inside the db or segment directory. The file would be based on a single
> MapWritable that contains arbitrary keys/value, including predefined
> ones such as the number of records, last update time etc. We would need
> to maintain it for each db and each segment. Each operation that changes
> a db or a segment would update this information.
> In practial terms, I propose to add static methods to CrawlDbReader,
> LinkDbReader and SegmentReader, which can retrieve and / or update this
> information.
> 3. Initialization of scoring plugins with global information
> ------------------------------------------------------------
> Current scoring API works only with local properties of the page (I'm
> not taking into account plugins that use external information sources -
> that's outside of the scope of the API). It doesn't have any built-in
> facilities to collect and calculate global properties useful for PR or
> HITS calculation, such as e.g. the number of dangling nodes (ie. pages
> without outlinks), their total score, the number of inlinks, etc. It
> doesn't have the facility to output this collected global information at
> the end of the job. Neither has it any facility to initialize scoring
> plugins with such information if one exists.
> I propose to add the following methods to scoring plugins, so that they
> can modify the job configuration right before the job is started, so
> that later on the plugins could use this information when scoring
> filters are initialized in each task. E.g:
> public void prepareInjectorConfig(Path crawlDb, Path urls, Configuration
> config);
> public void prepareGeneratorConfig(Path crawlDb, Configuration config);
> public void prepareIndexerConfig(Path crawlDb, Path linkDb, Path[]
> segments, Configuration config);
> public void prepareUpdateConfig(Path crawlDb, Path[] segments,
> Configuration config);

Should we really pass Path-s to methods? IMHO, opening a file and
reading from it looks a bit cumbersome. I would suggest that the
relevant job would read the file then pass the data (MapWritable) to
the method. For example, prepareGeneratorConfig would look like this:

public void prepareGeneratorConfig(MapWritable crawlDbMeta,
Configuration config);

> Example: to properly implement the OPIC scoring, it's necessary to
> collect the total number of dangling nodes, and the total score from
> these nodes. Then, in the next step it's necessary to spread this total
> score evenly among all other nodes in the crawldb. Currently this is not
> possible unless we run additional jobs, and create additional files to
> keep this data around between the steps. It would be more convenient to
> keep this data in CrawlDb metadata (see above) and make relevant values
> available in the job context (Configuration).
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>  Contact: info at sigram dot com

Doğacan Güney
View raw message