nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tien Nguyen Manh (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1686) Optimize UpdateDb to load less field from Store
Date Fri, 03 Jan 2014 02:57:51 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861142#comment-13861142
] 

Tien Nguyen Manh commented on NUTCH-1686:
-----------------------------------------

In this patch i also fixed an bug with fetchTime. Currently each time we run whole updatedb,
fetchTime is increased again for all urls.

> Optimize UpdateDb to load less field from Store
> -----------------------------------------------
>
>                 Key: NUTCH-1686
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1686
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.3
>            Reporter: Tien Nguyen Manh
>             Fix For: 2.3
>
>         Attachments: NUTCH-1686.patch
>
>
> While running large crawl i found that updatedb run very slow, especially the Map task
which loading data from store.
> We can't use filter by batchId to load less url due to bug in NUTCH-1679 so we must always
update the whole table.
> After checking the field loaded in UpdateDbJob i found that it load many fields from
store (at least 15/25 field) which make updatedb slow
> I think that UpdateDbJob only need to load few field: SCORE, OUTLINKS, METADATA which
is used to compute link score, distance that i think the main purpose of this job.
> The other fields is used to compute url schedule to parser and fetcher, we can move code
to Parser or Fetcher whithout loading much new field because many field are generated from
parser. WE can also use gora filter for Fetcher or Parser so load new field is not a problem.
> I also add new field SCOREMETA to WebPage to store CASH, and DISTANCE. It is currently
store in METADATA. field CASH is used in OPICScoring which is used only in UpdateDB and distance
is used only in Generator and Updater so move both field two new Metadata field can prevent
reading METADATA at Generator and Updater, METADATA contains many data that is used only at
Parser and Indexer
> So with new change
> UpdateDb only load SCORE, SCOREMATA (CASH, DISTANCE), OUTLINKS, MAKERS: we don't need
to load big family Fetch and INLINKS.
> Generator only load SCOREMETA (which is smaller than current METADATA)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message