nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Nagel <wastl.na...@googlemail.com>
Subject Re: [jira] [Commented] (NUTCH-2143) GeneratorJob ignores batch id passed as argument
Date Thu, 07 Jan 2016 19:20:36 GMT
Hi Lewis,

thanks! I'll commit the fix.

For 2.3.1 there are also open:
  NUTCH-2169 Integrate index-html into Nutch build
  NUTCH-2168 Parse-tika fails to retrieve parser
   - important: in effect, disables parse-tika for most document types
   - but has a drawback with index-solr when index-html is used to
     index the raw content of binary document types.
     I'll try to have a closer look on this problem right now.

I would like to commit these, too. Could you take the time to a look at?

Cheers,
Sebastian

On 01/07/2016 07:14 PM, Lewis John McGibbney (JIRA) wrote:
> 
>     [ https://issues.apache.org/jira/browse/NUTCH-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15087804#comment-15087804
] 
> 
> Lewis John McGibbney commented on NUTCH-2143:
> ---------------------------------------------
> 
> Tested v3 and confirmed to fix the issue.
> I am +1 to committing and I will roll a release candidate and get 2.X back on track.
> Thank you both [~wastl-nagel] and [~liuqibj] nice work!
> 
>> GeneratorJob ignores batch id passed as argument
>> ------------------------------------------------
>>
>>                 Key: NUTCH-2143
>>                 URL: https://issues.apache.org/jira/browse/NUTCH-2143
>>             Project: Nutch
>>          Issue Type: Bug
>>          Components: generator
>>    Affects Versions: 2.3.1
>>            Reporter: Sebastian Nagel
>>            Assignee: Lewis John McGibbney
>>            Priority: Blocker
>>             Fix For: 2.3.1
>>
>>         Attachments: NUTCH-2143-v2.patch, NUTCH-2143-v3.patch, patch
>>
>>
>> The batch id passed to GeneratorJob by option/argument -batchId <id> is ignored
and a generated batch id is used to mark the current batch. Log snippets from a run of bin/crawl:
>> {noformat}
>> bin/nutch generate ... -batchId 1444941073-14208
>> ...
>> GeneratorJob: generated batch id: 1444941074-858443668 containing 1 URLs
>> Fetching : 
>> bin/nutch fetch ... 1444941073-14208 ...
>> ...
>> QueueFeeder finished: total 0 records. Hit by time limit :0
>> {noformat}
>> The generated URLs are marked with the wrong batch id:
>> {noformat}
>> hbase(main):010:0> scan 'test_webpage'
>> ROW                            COLUMN+CELL
>>  org.apache.nutch:http/        column=f:bid, timestamp=1444941077080, value=1444941074-858443668
>>  ...
>>  org.apache.nutch:http/        column=mk:_gnmrk_, timestamp=1444941077080, value=1444941074-858443668
>> {noformat}
>> and fetcher will not fetch anything. This problem was reported by Sherban Drulea
[[1|https://www.mail-archive.com/user@nutch.apache.org/msg13894.html]], [[2|https://www.mail-archive.com/user@nutch.apache.org/msg13912.html]].
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
> 


Mime
View raw message