nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "Nutch_1.X_RESTAPI/RunningJobsTutorial" by SujenShah
Date Wed, 01 Apr 2015 03:54:57 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "Nutch_1.X_RESTAPI/RunningJobsTutorial" page has been changed by SujenShah:
https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI/RunningJobsTutorial

New page:
= How to run Jobs using the Nutch REST service =

<<TableOfContents(5)>>
== Introduction ==
This tutorial shows how REST calls can be made to the NutchServer to run various jobs like
Inject, Generate, Fetch, etc. 

== Instructions to start Nutch Server ==
Follow the steps below to start an instance of the Nutch Server on localhost. 

1. :~$ cd runtime/local 

2. :~$ bin/nutch startserver -port <port_number> -host <host_name> [If the host/port
option is not specified then by default the server starts on localhost:8081]

== Jobs ==
Currently the service supports the running of the following jobs - Inject, Generate, Fetch,
Parse, Updatedb, Invertlinks, Dedup and Readdb.
Any new job can be created by issuing a POST request to /job/create with following JSON data

{{{{
POST /job/create
   {
      "type":"job type",
      "confId":"default",
      "args":{"someParam":"someValue"}
   }
}}}}
=== Inject Job ===
To run the inject job call POST /job/create with following
{{{{
POST /job/create
{   
    "type":"INJECT",
    "confId":"default",
    "args": {"crawldb":"crawl/crawldb", "url_dir":"url/"}
}
}}}}
The args contain two keys - crawldb, url_dir. These should be put with appropriate values.
The response of the request is a JSON output
{{{{
{
   "confId":"default",
   "args":{"crawldb":"crawl/crawldb","url_dir":"url/"},
   "crawlId":null,
   "msg":"OK",
   "id":"default-INJECT-635077497",
   "state":"RUNNING",
   "type":"INJECT",
   "result":null
}
}}}}

=== Generate Job ===
To run the generate job call POST /job/create with following
{{{{
POST /job/create
{  
    "type":"GENERATE",
    "confId":"default",
    "args": {"crawldb":"crawl/crawldb", "segments_dir":"crawl/segments"}
}
}}}}
The args contain keys - crawldb, segments_dir, force, topN, numFetchers, adddays, noFilter,
noNorm, maxNumSegments. These should be put with appropriate values.

The description of these parameters can be found [[https://wiki.apache.org/nutch/bin/nutch%20generate|here]].

The response of the request is a JSON output
{{{{
{
    "confId":"default",
    "args":{"crawldb":"crawl/crawldb","segments_dir":"crawl/segments"},
    "crawlId":null,
    "msg":"OK",
    "id":"default-GENERATE-274614034",
    "state":"RUNNING",
    "type":"GENERATE",
    "result":null
}
}}}}

=== Fetch Job ===
To run the generate job call POST /job/create with following
{{{{
POST /job/create
{  
    "type":"FETCH",
    "confId":"default",
    "args": {"segment":"crawl/segments/20150331153517""}
}
}}}}
The args contain keys - segment, threads, noParsing. These should be put with appropriate
values.

The description of these parameters can be found [[https://wiki.apache.org/nutch/bin/nutch%20fetch
| here]].

The response of the request is a JSON output
{{{{
{
     "confId":"default",
     "args":{"segment":"crawl/segments/20150331153517"},
     "crawlId":null,
     "msg":"idle",
     "id":"default-FETCH-99398319",
     "state":"IDLE",
     "type":"FETCH",
     "result":null
}
}}}}

=== Parse Job ===
To run the generate job call POST /job/create with following
{{{{
POST /job/create
{  
    "type":"PARSE",
    "confId":"default",
    "args": {"segment":"crawl/segments/20150331153517", "noFilter":"true"}
}
}}}}
The args contain keys - segment, noFilter, noNormalize. These should be put with appropriate
values.

The description of these parameters can be found [[https://wiki.apache.org/nutch/bin/nutch%20parse
| here]].

The response of the request is a JSON output
{{{{
{
     "confId":"default",
     "args":{"segment":"crawl/segments/20150331153517","noFilter":"true"},
     "crawlId":null,
     "msg":"OK",
     "id":"default-PARSE-1413156163",
     "state":"IDLE",
     "type":"PARSE",
     "result":null
}
}}}}

=== Updatedb Job ===
To run the generate job call POST /job/create with following
{{{{
POST /job/create
{  
    "type":"UPDATEDB",
    "confId":"default",
    "args": {"crawldb":"crawl/crawldb", "segments":"crawl/segments/20150331153517"}
}
}}}}
The args contain keys - crawldb, segments, dir, force, normalize, filter, noAdditions. These
should be put with appropriate values.

To use multiple segments, the segments parameter should contain the names of the segments
seperated by space. If you wish to specify an entire directory then use the dir paramter.

The description of these parameters can be found [[https://wiki.apache.org/nutch/bin/nutch%20updatedb|here]].

The response of the request is a JSON output
{{{{
{
    "confId":"default",
    "args":{"crawldb":"crawl/crawldb","segments":"crawl/segments/20150331153517"},
    "crawlId":null,
    "msg":"OK",
    "id":"default-UPDATEDB-1250603698",
    "state":"RUNNING",
    "type":"UPDATEDB",
    "result":null
}
}}}}

=== Invertlinks Job ===
To run the generate job call POST /job/create with following
{{{{
POST /job/create
{  
    "type":"INVERTLINKS",
    "confId":"default",
    "args": {"linkdb":"crawl/linkdb", "dir":"crawl/segments"}
}
}}}}

The args contain keys - crawldb, segments, dir, force, noNormalize, noFilter. These should
be put with appropriate values.

To use multiple segments, the segments parameter should contain the names of the segments
seperated by space. If you wish to specify an entire directory then use the dir paramter.

The description of these parameters can be found [[https://wiki.apache.org/nutch/bin/nutch%20invertlinks|here]].

The response of the request is a JSON output
{{{{
{
    "confId":"default",
    "args":{"linkdb":"crawl/linkdb", "dir":"crawl/segments"},
    "crawlId":null,
    "msg":"OK",
    "id":"default-INVERTLINKS-572647647",
    "state":"RUNNING",
    "type":"INVERTLINKS",
    "result":null
}
}}}}


=== Dedup Job ===
To run the generate job call POST /job/create with following
{{{{
POST /job/create
{  
    "type":"DEDUP",
    "confId":"default",
    "args": {"crawldb":"crawl/crawldb"}
}
}}}}

The args contain keys - crawldb. These should be put with appropriate values.

The response of the request is a JSON output
{{{{
{
    "confId":"default",
    "args":{"crawldb":"crawl/crawldb"},
    "crawlId":null,
    "msg":"OK",
    "id":"default-DEDUP-1394212503",
    "state":"RUNNING",
    "type":"DEDUP",
    "result":null
}
}}}}

=== Readdb Job ===
To run the generate job call '''POST /db/readdb''' with following
{{{{
POST /db/readdb
{     
    "type":"stats",
    "confId":"default",
    "args":{"crawldb":"crawl/crawldb"}
}
}}}}
The different types are - dump, topN and url. Their corresponding arguments can be found [[https://wiki.apache.org/nutch/bin/nutch%20readdb|here]].

The response of the request is a JSON output
{{{{
  {
      "retry 0":"8350",
      "minScore":"0.0",
      "retry 1":"96",
      "status":{ 
                "3":{"count":"21","statusValue":"db_gone"},
                "2":{"count":"594","statusValue":"db_fetched"},
                "1":{"count":"7721","statusValue":"db_unfetched"},
                "5":{"count":"86","statusValue":"db_redir_perm"},
                "4":{"count":"24","statusValue":"db_redir_temp"}
                },
      "totalUrls":"8446",
      "maxScore":"0.528",
      "avgScore":"0.029593771"
  }
}}}}

Mime
View raw message