nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2143) GeneratorJob ignores batch id passed as argument
Date Thu, 07 Jan 2016 18:14:39 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15087804#comment-15087804
] 

Lewis John McGibbney commented on NUTCH-2143:
---------------------------------------------

Tested v3 and confirmed to fix the issue.
I am +1 to committing and I will roll a release candidate and get 2.X back on track.
Thank you both [~wastl-nagel] and [~liuqibj] nice work!

> GeneratorJob ignores batch id passed as argument
> ------------------------------------------------
>
>                 Key: NUTCH-2143
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2143
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 2.3.1
>            Reporter: Sebastian Nagel
>            Assignee: Lewis John McGibbney
>            Priority: Blocker
>             Fix For: 2.3.1
>
>         Attachments: NUTCH-2143-v2.patch, NUTCH-2143-v3.patch, patch
>
>
> The batch id passed to GeneratorJob by option/argument -batchId <id> is ignored
and a generated batch id is used to mark the current batch. Log snippets from a run of bin/crawl:
> {noformat}
> bin/nutch generate ... -batchId 1444941073-14208
> ...
> GeneratorJob: generated batch id: 1444941074-858443668 containing 1 URLs
> Fetching : 
> bin/nutch fetch ... 1444941073-14208 ...
> ...
> QueueFeeder finished: total 0 records. Hit by time limit :0
> {noformat}
> The generated URLs are marked with the wrong batch id:
> {noformat}
> hbase(main):010:0> scan 'test_webpage'
> ROW                            COLUMN+CELL
>  org.apache.nutch:http/        column=f:bid, timestamp=1444941077080, value=1444941074-858443668
>  ...
>  org.apache.nutch:http/        column=mk:_gnmrk_, timestamp=1444941077080, value=1444941074-858443668
> {noformat}
> and fetcher will not fetch anything. This problem was reported by Sherban Drulea [[1|https://www.mail-archive.com/user@nutch.apache.org/msg13894.html]],
[[2|https://www.mail-archive.com/user@nutch.apache.org/msg13912.html]].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message