nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script
Date Mon, 29 Jan 2018 12:33:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343307#comment-16343307
] 

ASF GitHub Bot commented on NUTCH-2501:
---------------------------------------

sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take NUTCH_HEAPSIZE
into account  when crawling using crawl script
URL: https://github.com/apache/nutch/pull/279#discussion_r164417155
 
 

 ##########
 File path: src/bin/crawl
 ##########
 @@ -171,6 +175,8 @@ fi
 
 CRAWL_PATH="$1"
 LIMIT="$2"
+JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"`
 
 Review comment:
   In [local mode](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation)
all reducer tasks run in a single JVM instance. Only in [pseudo-distributed mode](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation)
this could make some sense, given that all remaining resources (eg. number of CPUs) make it
possible to run all reduce tasks in parallel. In distributed mode you want to define the max.
heap size based on the configuration of your cluster nodes, because that defines how many
parallel tasks can be run on every node (in combination with other resource limits). The heap
size configured for a single task is usually used to define what is required to run the task
without running into an out-of-memory error. The Yarn resource manager verifies that the heap
size configured for the job tasks does not overflow the resource limits configured on the
cluster nodes. Otherwise the job will fail.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> ------------------------------------------------------------------
>
>                 Key: NUTCH-2501
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2501
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Moreno Feltscher
>            Assignee: Lewis John McGibbney
>            Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message