nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1674) Use batchId filter enable scan (GORA-119) for Fetch,Parse,Update,Index
Date Tue, 26 Nov 2013 13:47:35 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832592#comment-13832592
] 

Lewis John McGibbney commented on NUTCH-1674:
---------------------------------------------

If you guys wish to consolidate the patch over on GORA-119 then we may can discuss releasing
Gora which would mean that we can remove unreliable SNAPSHOT dependencies in this patch and
test it more comprehensively before committing. Once we commit, we can consider releasing
Nutch 2.3. Right now this patch looks the part and I am really looking forward to see what
more HBase users think. Thank you for investing time to this patch Nguyen.

> Use batchId filter enable scan (GORA-119) for Fetch,Parse,Update,Index
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-1674
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1674
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.3
>            Reporter: Nguyen Manh Tien
>             Fix For: 2.3
>
>         Attachments: NUTCH-1674.patch
>
>
> Nutch always scan the whole crawldb in each phrase (generate, fetch, parse, update, index).
When crawldb is big, the time to scan is bigger than the actual processing time.
> We really need to skip records while scanning using GORA-119 for example we can only
get records belong to a specified batchId.
> In my crawl the filter reduce the time to scan from 90 min to 30 min.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message