nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christophe Noel <christophe.n...@cetic.be>
Subject Plugins - sum up
Date Thu, 03 Mar 2005 13:56:47 GMT
Hello,

Nutch plugins allows some functionnalities as parsing pdf, as indexing 
"date modified" tags.

Please confirm this little sum up and tutorial for plugins.

(1)
PARSING plugins : allow to parse different kinds of mime types -> html, 
text, pdf, msword, mp3, rtf
** parse-ext ** is a wrapper ... what can it do ?

INDEXING plugins : allow to index different field of the fetched pages
** index-basic : basic indexing
** index-more :  index "last modified" tag, and "content-type" tag, 
"file-length" is coming soon...

QUERY plugins : allow different queries (query-basic handle basic 
queries of' course)
** query-site : query handler for site as "nutch site:www.nutch.org 
(missing : a whole search as "site:www.nutch.org)
** query-url : query handler for url searches.

PROTOCOL plugins : handle different protocols as file, http, and ftp

Unknown (or bad-known) by myself :
ONTHOLOGY
CLUSTERING CARROT2
LANGUAGE-IDENTIFIER
(please explain).

(2) USE PLUGINS
Use the following kind of tags in the nutch-site.xml or nutch-default.xml
<nutch-conf>
<property>
  <name>plugin.includes</name>
  
<value>protocol-(http|ftp)|parse-(text|html|pdf|rtf|msword|ext)|index-basic|query-(basic|site|url)|language-identifier</value>
</property>

(3) HOW TO MAKE A PLUGIN ?
What are main difficulties to make a plugin ?

Thanks for your help. This could be great to talk about it on the Wiki.

Mime
View raw message