nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "IndexWriters" by RoannelFernandez
Date Fri, 22 Jun 2018 17:14:05 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "IndexWriters" page has been changed by RoannelFernandez:
https://wiki.apache.org/nutch/IndexWriters?action=diff&rev1=12&rev2=13

Comment:
Resolving some feedback

     * `<destination>` indicates the new name of the field. For example: if the configuration
is `<field source="metatag.description" dest="description"/>`, the field '''metatag.description'''
will be renamed as '''description'''.
   * `<remove>` indicates which fields of the document should be removed. Each child
element of `<remove>` element, has the form: `<field source="<source>"/>`
     * `<source>` indicates the field's name to be remove.
+ 
+ {{{{#!wiki caution
+ '''Mapping section can't be empty'''
+ 
+ If you don't want to modify the document, just leave `<copy>`, `<rename>` and
`<remove>` empty, like: `<mapping> <copy /> <rename /> <remove
/> </mapping>`
+ 
+ }}}}
+ 
+ === Use case ===
+ 
+ We have two servers previously configured (Solr and RabbitMQ). We want to send documents
to each one, but with a different structure. Prior to the index step, each document has this
hypothetical structure:
+ 
+ {{{#!highlight properties
+ host: "www.example.org"
+ domain: "example.org"
+ title: "Example page"
+ metatag.description: "Example page description"
+ metatag.keywords: ["example", "page"]
+ segment: 20180621163128
+ }}}
+ With this configuration we modify the structure of each document in different ways, depending
the index writer:
+ 
+ {{{#!highlight xml
+ <writer id="indexer_solr_1" class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
+   <parameters>
+     <!-- Parameters here -->
+   </parameters>
+   <mapping>
+     <copy/>
+     <rename>
+       <field source="metatag.description" dest="description"/>
+       <field source="metatag.keywords" dest="keywords"/>
+     </rename>
+     <remove>
+       <field source="segment"/>
+     </remove>
+   </mapping>
+ </writer>
+ <writer id="indexer_rabbit_1" class="org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter">
+   <parameters>
+     <!-- Parameters here -->
+   </parameters>
+   <mapping>
+     <copy>
+       <field source="title" dest="search"/>
+     </copy>
+     <rename>
+       <field source="metatag.description" dest="description"/>
+       <field source="metatag.keywords" dest="keywords"/>
+       <field source="domain" dest="domain_name"/>
+     </rename>
+     <remove />
+   </mapping>
+ </writer>
+ }}}
+ 
+ For `indexer-solr` we'll get documents like:
+ 
+ {{{#!highlight properties
+ host: "www.example.org"
+ domain: "example.org"
+ title: "Example page"
+ description: "Example page description"
+ keywords: ["example", "page"]
+ }}}
+ 
+ For `indexer-rabbit` the document's structure is like:
+ 
+ {{{#!highlight properties
+ host: "www.example.org"
+ domain_name: "example.org"
+ title: "Example page"
+ search: "Example page"
+ description: "Example page description"
+ keywords: ["example", "page"]
+ segment: 20180621163128
+ }}}
  
  == Parameters section ==
  

Mime
View raw message