nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alparslan Avcı (JIRA) <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4
Date Thu, 01 May 2014 10:24:17 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986479#comment-13986479
] 

Alparslan Avcı commented on NUTCH-1714:
---------------------------------------

Hi [~jnioche],

Thanks for the reviews and tests. For the issues;
bq. There is no progression of the complete status of mappers : they go from 0% to 100% for
the tasks taking the input from GORA i.e not the injection
As [~lewismc] said, I also do not have any idea. I will also have a look at this.
bq. The whole content of the webtable seems to be taken as input for mapreduce. I assumed
it wouldn't be the case for GORA-119 and that the fetch step for instance would get only the
entries marked by the Generator. There is NUTCH-1674 but this should only add the batchID
to the filters according to its title.
This [patch|https://issues.apache.org/jira/secure/attachment/12642309/NUTCH-1714v4.patch]
only contains updates for using gora-0.4 in Nutch. And in NUTCH-1674, we only have fixes for
batchId filters. As I said in the comment;
bq. In the patch I added, I applied the possible filters (which are only batchId filters for
now) to the jobs. After the implementation of new Hbase filters and filterset on Gora, we
can add new filters (eg.:Non-existance of Mark.FETCH_MARK filter for FetcherJob) and clean
the map functions from some controls.
we can open another issue to implement other filters for Nutch.
bq. ./nutch readdb -crawlId MYCRAWLIDHERE -stats gets 0 docs but I can see the corresponding
table in HBase.
I will also try this command. Let me try to find the problem and share the results with you.

> Nutch 2.x upgrade to Gora 0.4
> -----------------------------
>
>                 Key: NUTCH-1714
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1714
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Alparslan Avcı
>            Assignee: Alparslan Avcı
>             Fix For: 2.3
>
>         Attachments: NUTCH-1714.patch, NUTCH-1714_NUTCH-1714_v2_v3.patch, NUTCH-1714v2.patch,
NUTCH-1714v4.patch
>
>
> Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the details in
this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message