nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1028) Log parser keys
Date Tue, 09 Aug 2011 12:07:27 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081585#comment-13081585
] 

Julien Nioche commented on NUTCH-1028:
--------------------------------------

You can see the progression of the parsing on the hadoop job tracker in distributed mode +
it has a counter for the number of documents succesfully parsed.
Of course you won't see that in local mode, but if you want to parse large segments then using
the (pseudo)distributed mode would be a good option anyway as you'd potentially have more
than 1 mapper or reducer at work and would leverage the multiple cores that your machine certainly
has, not even mentioning the benefits of replicated storage etc....
Your suggestion is good though and it makes sense to have a consistent behaviour across the
various jobs.

> Log parser keys
> ---------------
>
>                 Key: NUTCH-1028
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1028
>             Project: Nutch
>          Issue Type: Task
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Trivial
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1028-1.4-1.patch
>
>
> The parser can take ages (many hours) to complete. During this time the only output is
an error or warning if it's unable to parse something (which is very common). Sometimes the
parser can run for several hours without any output: this is scary. I propose to add a LOG.info
to the mapper and write the key when parsing, similar to the fetcher.
> Thoughts?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message