nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ferdy Galema (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-872) Change the default fetcher.parse to FALSE
Date Fri, 07 Sep 2012 08:35:08 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450461#comment-13450461
] 

Ferdy Galema commented on NUTCH-872:
------------------------------------

Christian, I ran a testcrawl with Nutch2.x branch and it does seem to work right:

bin/nutch inject ~/urls/
bin/nutch generate
bin/nutch fetch -Dfetcher.parse=true -Dfetcher.store.content=false theBatchId

Now I check my HBase and the content family is empty for the fetched/parsed urls. And they
are parsed correctly.

If your problem persists, please try to explain in detail how you run the crawl. (Also it
is better to put it onto mailing list next time.)
                
> Change the default fetcher.parse to FALSE
> -----------------------------------------
>
>                 Key: NUTCH-872
>                 URL: https://issues.apache.org/jira/browse/NUTCH-872
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.2, 1.3, nutchgora
>            Reporter: Andrzej Bialecki 
>
> I propose to change this property to false. The reason is that it's a safer default -
parsing issues don't lead to a loss of the downloaded content. For larger crawls this is the
recommended way to run Fetcher. Users that run smaller crawls can still override it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message