nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch
Date Mon, 06 Aug 2012 10:12:02 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429058#comment-13429058
] 

Julien Nioche commented on NUTCH-1445:
--------------------------------------

Ferdy - just to reiterate what was said on a previous issue : please give people time to review
your contribs before committing your own stuff. I am sure your code is fine and it does not
really affect existing code too much but I think it is a good practice that we should try
and stick to.

Instead of having multiple commands for the indexing backends can't we have a single job and
define what the backends (SOLR, ES) via configuration? There is an open issue on 'pluggable
indexing backends' [https://issues.apache.org/jira/browse/NUTCH-1047] can we discuss this
there?


                
> Add ElasticIndexerJob that indexes to elasticsearch
> ---------------------------------------------------
>
>                 Key: NUTCH-1445
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1445
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Ferdy Galema
>             Fix For: 2.1
>
>         Attachments: NUTCH-1445-addPropsToConfig.patch, NUTCH-1445-addToNutchScript.patch,
NUTCH-1445.patch
>
>
> We have created a new indexer job ElasticIndexerJob that indexes to elasticsearch. It
is orginally based upon https://github.com/ctjmorgan/nutch-elasticsearch-indexer (Apache2
license), but we have modified it greatly to make it integrate as good as possible into Nutch.
The greatest modification is that documents are asynchronously flushed in bulk to elasticsearch.
> Elasticsearch rocks. Both performance and ease of confiugration is awesome. You simply
deploy a server by unpacking the tar, configure the clustername, start the server and fire
away indexing requests. Indices are automatically created. Fields are automapped. (Of course
it is recommended to create your own optimized mapping, but that is beyond scope of this issue).
Multiple servers connect without extra configuration, simply by using the same clustername.
(By means of multicast). There a tons of advanced options, such as sharding, replication,
disk striping etc.
> To give an example of the performance: With 20+ nodes we are able to index over 1M docs
(average sized webdocuments) per minute. The best part is that the added documents are almost
instantly searchable, so there no hidden commit costs that Solr has. This is with out-of-the-box
configuration.
> (I will attach patch and commit for Nutch2. Feel free to adapt for trunk.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message