nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Trivial Update of "IndexStructure" by LewisJohnMcgibbney
Date Sat, 16 Feb 2013 04:45:51 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "IndexStructure" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/IndexStructure?action=diff&rev1=17&rev2=18

  
  The index structure formed after indexing is shown below : 
  
- ||'''Field Name'''||'''Stored'''||'''Index'''|| '''Plugin''' ||'''Comment'''||
+ ||'''Field Name'''||'''Stored'''||'''Index'''|| '''Plugin/Class''' ||'''Comment'''||
- || 	boost 	 ||	YES 	|| 	Not Indexed 	|| scoring-opic/link || Adds a '''score''' value field
to a particular document. This is allocated based upon its importance within the webgraph.
||
+ || 	boost 	 ||	YES 	|| 	Not Indexed 	|| various scoring plugins || Adds a '''score''' value
field to a particular document. This is allocated based upon its importance within the webgraph.
||
- || 	digest 	||	YES 	||	Not Indexed 	||  /!\ NEEDS COMMENT /!\|| Adds a '''message digest'''
field to a document. Can be MD5 over content and headers or more sophisticated text profile
of the content. ||
+ || 	digest 	||	YES 	||	Not Indexed 	|| org.apache.nutch.indexer.IndexerMapReduce.java ||
Adds a '''message digest''' field to a document. Can be MD5 over content and headers or more
sophisticated text profile of the content. ||
  || 	lang 	||	YES 	||	Un-Tokenized 	||	language-identifier || Add a '''lang''', language
field to a document.||
- || 	segment ||		YES 	||	Not Indexed 	|| /!\ NEEDS COMMENT /!\ || Adds the originating '''segment'''
field to the document, used to identify the most recent segment in which this document was
fetched. ||
+ || 	segment ||		YES 	||	Not Indexed 	|| org.apache.nutch.indexer.IndexerMapReduce.java ||
Adds the originating '''segment''' field to the document, used to identify the most recent
segment in which this document was fetched. ||
  || 	tstamp	||	YES 	||	Tokenized 	|| /!\ NEEDS COMMENT /!\ || Adds a '''timestamp''' field
of the most recent time this document was fetched ||
  || 	cc:license	||	YES 	||	Indexed, Tokenized 	|| creativecommons || Adds the entire license
as '''cc:license=xxx''' and '''attributes''' extracted of the license url||
  || 	cc:meta	||	YES 	||	Indexed, Tokenized 	||	creativecommons || Adds the license location
as '''cc:meta=xxx''' ||

Mime
View raw message