nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "Nutch_1.X_RESTAPI/RunningJobsTutorial" by SujenShah
Date Wed, 20 May 2015 11:58:31 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "Nutch_1.X_RESTAPI/RunningJobsTutorial" page has been changed by SujenShah:
https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI/RunningJobsTutorial?action=diff&rev1=3&rev2=4

  {   
      "type":"INJECT",
      "confId":"default",
-     "args": {"crawldb":"crawl/crawldb", "url_dir":"url/"}
+     "crawlId":"crawl01"
+     "args": {"url_dir":"url/"}
  }
  }}}}
- The args contain two keys - crawldb, url_dir. These should be put with appropriate values.
+ The args contains one key - url_dir. This should correspond to the path of the url dir where
the seed file is stored
  The response of the request is a JSON output
  {{{{
  {
     "confId":"default",
-    "args":{"crawldb":"crawl/crawldb","url_dir":"url/"},
-    "crawlId":null,
+    "args":{"url_dir":"url/"},
+    "crawlId":"crawl01",
     "msg":"OK",
     "id":"default-INJECT-635077497",
     "state":"RUNNING",
@@ -56, +57 @@

  {  
      "type":"GENERATE",
      "confId":"default",
-     "args": {"crawldb":"crawl/crawldb", "segments_dir":"crawl/segments"}
+     "crawlId":"crawl01",
+     "args": {}
  }
  }}}}
- The args contain keys - crawldb, segments_dir, force, topN, numFetchers, adddays, noFilter,
noNorm, maxNumSegments. These should be put with appropriate values.
+ The args contain keys - force, topN, numFetchers, adddays, noFilter, noNorm, maxNumSegments.
These should be put with appropriate values.
  
  The description of these parameters can be found [[https://wiki.apache.org/nutch/bin/nutch%20generate|here]].
  
@@ -67, +69 @@

  {{{{
  {
      "confId":"default",
-     "args":{"crawldb":"crawl/crawldb","segments_dir":"crawl/segments"},
-     "crawlId":null,
+     "args":{},
+     "crawlId":"crawl01",
      "msg":"OK",
      "id":"default-GENERATE-274614034",
      "state":"RUNNING",
@@ -84, +86 @@

  {  
      "type":"FETCH",
      "confId":"default",
-     "args": {"segment":"crawl/segments/20150331153517""}
+     "crawlId":"crawl01",
+     "args": {}
  }
  }}}}
- The args contain keys - segment, threads, noParsing. These should be put with appropriate
values.
+ The args contain keys - threads, noParsing. These should be put with appropriate values.
  
  The description of these parameters can be found [[https://wiki.apache.org/nutch/bin/nutch%20fetch
| here]].
  
@@ -95, +98 @@

  {{{{
  {
       "confId":"default",
-      "args":{"segment":"crawl/segments/20150331153517"},
+      "args":{},
-      "crawlId":null,
+      "crawlId":"crawl01",
       "msg":"idle",
       "id":"default-FETCH-99398319",
       "state":"IDLE",
@@ -112, +115 @@

  {  
      "type":"PARSE",
      "confId":"default",
-     "args": {"segment":"crawl/segments/20150331153517", "noFilter":"true"}
+     "crawlId":"crawl01",
+     "args": {"noFilter":"true"}
  }
  }}}}
- The args contain keys - segment, noFilter, noNormalize. These should be put with appropriate
values.
+ The args contain keys - noFilter, noNormalize. These should be put with appropriate values.
  
  The description of these parameters can be found [[https://wiki.apache.org/nutch/bin/nutch%20parse
| here]].
  
@@ -123, +127 @@

  {{{{
  {
       "confId":"default",
-      "args":{"segment":"crawl/segments/20150331153517","noFilter":"true"},
+      "args":{"noFilter":"true"},
-      "crawlId":null,
+      "crawlId":"crawl01",
       "msg":"OK",
       "id":"default-PARSE-1413156163",
       "state":"IDLE",
@@ -140, +144 @@

  {  
      "type":"UPDATEDB",
      "confId":"default",
-     "args": {"crawldb":"crawl/crawldb", "segments":"crawl/segments/20150331153517"}
+     "crawlId":"crawl01",
+     "args": {}
  }
  }}}}
- The args contain keys - crawldb, segments, dir, force, normalize, filter, noAdditions. These
should be put with appropriate values.
+ The args contain keys - force, normalize, filter, noAdditions. These should be put with
appropriate values.
- 
- To use multiple segments, the segments parameter should contain the names of the segments
seperated by space. If you wish to specify an entire directory then use the dir paramter.
  
  The description of these parameters can be found [[https://wiki.apache.org/nutch/bin/nutch%20updatedb|here]].
  
@@ -170, +173 @@

  {  
      "type":"INVERTLINKS",
      "confId":"default",
-     "args": {"linkdb":"crawl/linkdb", "dir":"crawl/segments"}
+     "crawlId":"crawl01",
+     "args": {}
  }
  }}}}
  
- The args contain keys - crawldb, segments, dir, force, noNormalize, noFilter. These should
be put with appropriate values.
+ The args contain keys -force, noNormalize, noFilter. These should be put with appropriate
values.
- 
- To use multiple segments, the segments parameter should contain the names of the segments
seperated by space. If you wish to specify an entire directory then use the dir paramter.
  
  The description of these parameters can be found [[https://wiki.apache.org/nutch/bin/nutch%20invertlinks|here]].
  
@@ -184, +186 @@

  {{{{
  {
      "confId":"default",
-     "args":{"linkdb":"crawl/linkdb", "dir":"crawl/segments"},
-     "crawlId":null,
+     "args":{},
+     "crawlId":"crawl01",
      "msg":"OK",
      "id":"default-INVERTLINKS-572647647",
      "state":"RUNNING",
@@ -202, +204 @@

  {  
      "type":"DEDUP",
      "confId":"default",
-     "args": {"crawldb":"crawl/crawldb"}
+     "crawlId":"crawl01",
+     "args": {}
  }
  }}}}
- 
- The args contain keys - crawldb. These should be put with appropriate values.
  
  The response of the request is a JSON output
  {{{{
  {
      "confId":"default",
      "args":{"crawldb":"crawl/crawldb"},
-     "crawlId":null,
+     "crawlId":"crawl01",
      "msg":"OK",
      "id":"default-DEDUP-1394212503",
      "state":"RUNNING",
@@ -222, +223 @@

  }
  }}}}
  
- === Readdb Job ===
- To run the generate job call '''POST /db/readdb''' with following
- {{{{
- POST /db/readdb
- {     
-     "type":"stats",
-     "confId":"default",
-     "args":{"crawldb":"crawl/crawldb"}
- }
- }}}}
- The different types are - dump, topN and url. Their corresponding arguments can be found
[[https://wiki.apache.org/nutch/bin/nutch%20readdb|here]].
- 
- The response of the request is a JSON output
- {{{{
-   {
-       "retry 0":"8350",
-       "minScore":"0.0",
-       "retry 1":"96",
-       "status":{ 
-                 "3":{"count":"21","statusValue":"db_gone"},
-                 "2":{"count":"594","statusValue":"db_fetched"},
-                 "1":{"count":"7721","statusValue":"db_unfetched"},
-                 "5":{"count":"86","statusValue":"db_redir_perm"},
-                 "4":{"count":"24","statusValue":"db_redir_temp"}
-                 },
-       "totalUrls":"8446",
-       "maxScore":"0.528",
-       "avgScore":"0.029593771"
-   }
- }}}}
- '''Note: ''' If any other type than stats, like dump, topN, url is used then the response
will be a file (application-octet-stream).
- 

Mime
View raw message