nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1714) Nutch 2.x upgrade to use GORA_94 branch
Date Wed, 30 Apr 2014 12:57:20 GMT


Julien Nioche commented on NUTCH-1714:

Hi [~alparslan.avci]

I have been trying your patch and found several issues. They might not be directly caused
by it but could be related to Gora 0.4. BTW can I suggest a change of title for this issue
to "Upgrade to Gora 0.4" now that it has been released? I am running a crawl on 1.3M URLs
in pseudo-distributed mode with HBase.

* There is no progression of the complete status of mappers : they go from 0% to 100% for
the tasks taking the input from GORA i.e not the injection
* The whole content of the webtable seems to be taken as input for mapreduce. I assumed it
wouldn't be the case for [GORA-119] and that the fetch step for instance would get only the
entries marked by the Generator. There is [NUTCH-1674] but this should only add the batchID
to the filters according to its title.
* ./nutch readdb -crawlId MYCRAWLIDHERE  -stats gets 0 docs but I can see the corresponding
table in HBase.

Thanks! Julien


> Nutch 2.x upgrade to use GORA_94 branch
> ---------------------------------------
>                 Key: NUTCH-1714
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Alparslan Avcı
>            Assignee: Alparslan Avcı
>         Attachments: NUTCH-1714.patch, NUTCH-1714_NUTCH-1714_v2_v3.patch, NUTCH-1714v2.patch,
> Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the details in
this issue.

This message was sent by Atlassian JIRA

View raw message