nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt MacDonald (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch
Date Fri, 31 Aug 2012 11:15:07 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445841#comment-13445841
] 

Matt MacDonald commented on NUTCH-1445:
---------------------------------------

Hi,

I'm attempting to use the ElasticSearch indexer support and running into an issue that I hope
you can help with. Given how new this feature is to Nutch, there is little writing about how
to use it so I'm hoping it's ok to post the error I'm bumping into here. If I should open
a new JIRA ticket rather than commenting on this ticket please let me know. Any ideas about
how to call and/or configure my Nutch 2.x and ElasticSearch 0.19.4 setup so that I can use
ElasticSearch as the search index?

I'm running the elasticindex command with the following:

{noformat}bin/nutch elasticindex "Doppleganger" -reindex{noformat}

*and seeing this as the output*
{noformat}
[ matt@Office-iMac ~/Projects/nutch-trunk/runtime/local (git::2.x) ] bin/nutch elasticindex
"Doppleganger" -reindex
2012-08-31 06:44:09.238 java[53609:1903] Unable to load realm info from SCDynamicStore
Exception in thread "main" java.lang.RuntimeException: job failed: name=elastic-index [Doppleganger],
jobid=job_local_0001
	at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
	at org.apache.nutch.indexer.elastic.ElasticIndexerJob.run(ElasticIndexerJob.java:52)
	at org.apache.nutch.indexer.elastic.ElasticIndexerJob.indexElastic(ElasticIndexerJob.java:60)
	at org.apache.nutch.indexer.elastic.ElasticIndexerJob.run(ElasticIndexerJob.java:73)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.nutch.indexer.elastic.ElasticIndexerJob.main(ElasticIndexerJob.java:78)
{noformat}

*Checking logs/hadoop.log shows*
{noformat}
2012-08-31 06:44:41,581 WARN  elasticsearch.discovery - [Mother Night] waited for 30s and
no initial state was set by the discovery
2012-08-31 06:44:41,581 INFO  elasticsearch.discovery - [Mother Night] Doppleganger/2IUXHWKhQfGsBhmPiozyqg
2012-08-31 06:44:41,584 INFO  elasticsearch.http - [Mother Night] bound_address {inet[/0.0.0.0:9202]},
publish_address {inet[/192.168.1.133:9202]}
2012-08-31 06:44:41,585 INFO  elasticsearch.node - [Mother Night] {0.19.4}[53609]: started
2012-08-31 06:44:41,587 INFO  basic.BasicIndexingFilter - Maximum title length for indexing
set to: 100
2012-08-31 06:44:41,587 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2012-08-31 06:44:41,587 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
2012-08-31 06:44:41,587 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2012-08-31 06:44:42,174 INFO  elastic.ElasticWriter - Processing bulk request [docs = 500,
length = 732991, total docs = 500, last doc in bulk = 'us.ma.watertown.ci.www:http/Archive.aspx?ADID=357']
2012-08-31 06:44:42,492 INFO  elastic.ElasticWriter - Processing bulk request [docs = 500,
length = 943572, total docs = 1000, last doc in bulk = 'us.ma.watertown.ci.www:http/Directory.aspx?DID=92']
2012-08-31 06:44:42,493 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
2012-08-31 06:44:42,494 WARN  mapred.LocalJobRunner - job_local_0001
org.elasticsearch.action.ActionRequestValidationException: Validation Failed: 1: type is missing;2:
type is missing;3: type is missing;4: type is missing;5: type is missing;6: type is missing;7:
type is missing;8: type is missing;9: type is missing;10: type is missing;11: type is missing;12:
type is missing;13: type is missing;14: type is missing;15: type is missing;16: type is missing;17:
type is missing;18: type is missing;19: type is missing;20: type is missing;21: type is missing;22:
type is missing;23: type is missing;24: type is missing;25: type is missing;26: type is missing;27:
type is missing;28: type is missing;29: type is missing;30: type is missing;31: type is missing;32:
type is missing;33: type is missing;34: type is missing;35: type is missing;36: type is missing;37:
type is missing;38: type is missing;39: type is missing;40: type is missing;41: type is missing;42:
type is missing;43: type is missing;44: type is missing;45: type is missing;46: type is missing;47:
type is missing;48: type is missing;49: type is missing;50: type is missing;51: type is missing;52:
type is missing;53: type is missing;54: type is missing;55: type is missing;56: type is missing;57:
type is missing;58: type is missing;59: type is missing;60: type is missing;61: type is missing;62:
type is missing;63: type is missing;64: type is missing;65: type is missing;66: type is missing;67:
type is missing;68: type is missing;69: type is missing;70: type is missing;71: type is missing;72:
type is missing;73: type is missing;74: type is missing;75: type is missing;76: type is missing;77:
type is missing;78: type is missing;79: type is missing;80: type is missing;81: type is missing;82:
type is missing;83: type is missing;84: type is missing;85: type is missing;86: type is missing;87:
type is missing;88: type is missing;89: type is missing;90: type is missing;91: type is missing;92:
type is missing;93: type is missing;94: type is missing;95: type is missing;96: type is missing;97:
type is missing;98: type is missing;99: type is missing;100: type is missing;101: type is
missing;102: type is missing;103: type is missing;104: type is missing;105: type is missing;106:
type is missing;107: type is missing;108: type is missing;109: type is missing;110: type is
missing;111: type is missing;112: type is missing;113: type is missing;114: type is missing;115:
type is missing;116: type is missing;117: type is missing;118: type is missing;119: type is
missing;120: type is missing;121: type is missing;122: type is missing;123: type is missing;124:
type is missing;125: type is missing;126: type is missing;127: type is missing;128: type is
missing;129: type is missing;130: type is missing;131: type is missing;132: type is missing;133:
type is missing;134: type is missing;135: type is missing;136: type is missing;137: type is
missing;138: type is missing;139: type is missing;140: type is missing;141: type is missing;142:
type is missing;143: type is missing;144: type is missing;145: type is missing;146: type is
missing;147: type is missing;148: type is missing;149: type is missing;150: type is missing;151:
type is missing;152: type is missing;153: type is missing;154: type is missing;155: type is
missing;156: type is missing;157: type is missing;158: type is missing;159: type is missing;160:
type is missing;161: type is missing;162: type is missing;163: type is missing;164: type is
missing;165: type is missing;166: type is missing;167: type is missing;168: type is missing;169:
type is missing;170: type is missing;171: type is missing;172: type is missing;173: type is
missing;174: type is missing;175: type is missing;176: type is missing;177: type is missing;178:
type is missing;179: type is missing;180: type is missing;181: type is missing;182: type is
missing;183: type is missing;184: type is missing;185: type is missing;186: type is missing;187:
type is missing;188: type is missing;189: type is missing;190: type is missing;191: type is
missing;192: type is missing;193: type is missing;194: type is missing;195: type is missing;196:
type is missing;197: type is missing;198: type is missing;199: type is missing;200: type is
missing;201: type is missing;202: type is missing;203: type is missing;204: type is missing;205:
type is missing;206: type is missing;207: type is missing;208: type is missing;209: type is
missing;210: type is missing;211: type is missing;212: type is missing;213: type is missing;214:
type is missing;215: type is missing;216: type is missing;217: type is missing;218: type is
missing;219: type is missing;220: type is missing;221: type is missing;222: type is missing;223:
type is missing;224: type is missing;225: type is missing;226: type is missing;227: type is
missing;228: type is missing;229: type is missing;230: type is missing;231: type is missing;232:
type is missing;233: type is missing;234: type is missing;235: type is missing;236: type is
missing;237: type is missing;238: type is missing;239: type is missing;240: type is missing;241:
type is missing;242: type is missing;243: type is missing;244: type is missing;245: type is
missing;246: type is missing;247: type is missing;248: type is missing;249: type is missing;250:
type is missing;251: type is missing;252: type is missing;253: type is missing;254: type is
missing;255: type is missing;256: type is missing;257: type is missing;258: type is missing;259:
type is missing;260: type is missing;261: type is missing;262: type is missing;263: type is
missing;264: type is missing;265: type is missing;266: type is missing;267: type is missing;268:
type is missing;269: type is missing;270: type is missing;271: type is missing;272: type is
missing;273: type is missing;274: type is missing;275: type is missing;276: type is missing;277:
type is missing;278: type is missing;279: type is missing;280: type is missing;281: type is
missing;282: type is missing;283: type is missing;284: type is missing;285: type is missing;286:
type is missing;287: type is missing;288: type is missing;289: type is missing;290: type is
missing;291: type is missing;292: type is missing;293: type is missing;294: type is missing;295:
type is missing;296: type is missing;297: type is missing;298: type is missing;299: type is
missing;300: type is missing;301: type is missing;302: type is missing;303: type is missing;304:
type is missing;305: type is missing;306: type is missing;307: type is missing;308: type is
missing;309: type is missing;310: type is missing;311: type is missing;312: type is missing;313:
type is missing;314: type is missing;315: type is missing;316: type is missing;317: type is
missing;318: type is missing;319: type is missing;320: type is missing;321: type is missing;322:
type is missing;323: type is missing;324: type is missing;325: type is missing;326: type is
missing;327: type is missing;328: type is missing;329: type is missing;330: type is missing;331:
type is missing;332: type is missing;333: type is missing;334: type is missing;335: type is
missing;336: type is missing;337: type is missing;338: type is missing;339: type is missing;340:
type is missing;341: type is missing;342: type is missing;343: type is missing;344: type is
missing;345: type is missing;346: type is missing;347: type is missing;348: type is missing;349:
type is missing;350: type is missing;351: type is missing;352: type is missing;353: type is
missing;354: type is missing;355: type is missing;356: type is missing;357: type is missing;358:
type is missing;359: type is missing;360: type is missing;361: type is missing;362: type is
missing;363: type is missing;364: type is missing;365: type is missing;366: type is missing;367:
type is missing;368: type is missing;369: type is missing;370: type is missing;371: type is
missing;372: type is missing;373: type is missing;374: type is missing;375: type is missing;376:
type is missing;377: type is missing;378: type is missing;379: type is missing;380: type is
missing;381: type is missing;382: type is missing;383: type is missing;384: type is missing;385:
type is missing;386: type is missing;387: type is missing;388: type is missing;389: type is
missing;390: type is missing;391: type is missing;392: type is missing;393: type is missing;394:
type is missing;395: type is missing;396: type is missing;397: type is missing;398: type is
missing;399: type is missing;400: type is missing;401: type is missing;402: type is missing;403:
type is missing;404: type is missing;405: type is missing;406: type is missing;407: type is
missing;408: type is missing;409: type is missing;410: type is missing;411: type is missing;412:
type is missing;413: type is missing;414: type is missing;415: type is missing;416: type is
missing;417: type is missing;418: type is missing;419: type is missing;420: type is missing;421:
type is missing;422: type is missing;423: type is missing;424: type is missing;425: type is
missing;426: type is missing;427: type is missing;428: type is missing;429: type is missing;430:
type is missing;431: type is missing;432: type is missing;433: type is missing;434: type is
missing;435: type is missing;436: type is missing;437: type is missing;438: type is missing;439:
type is missing;440: type is missing;441: type is missing;442: type is missing;443: type is
missing;444: type is missing;445: type is missing;446: type is missing;447: type is missing;448:
type is missing;449: type is missing;450: type is missing;451: type is missing;452: type is
missing;453: type is missing;454: type is missing;455: type is missing;456: type is missing;457:
type is missing;458: type is missing;459: type is missing;460: type is missing;461: type is
missing;462: type is missing;463: type is missing;464: type is missing;465: type is missing;466:
type is missing;467: type is missing;468: type is missing;469: type is missing;470: type is
missing;471: type is missing;472: type is missing;473: type is missing;474: type is missing;475:
type is missing;476: type is missing;477: type is missing;478: type is missing;479: type is
missing;480: type is missing;481: type is missing;482: type is missing;483: type is missing;484:
type is missing;485: type is missing;486: type is missing;487: type is missing;488: type is
missing;489: type is missing;490: type is missing;491: type is missing;492: type is missing;493:
type is missing;494: type is missing;495: type is missing;496: type is missing;497: type is
missing;498: type is missing;499: type is missing;500: type is missing;
	at org.elasticsearch.action.bulk.BulkRequest.validate(BulkRequest.java:265)
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:55)
	at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:83)
	at org.elasticsearch.client.support.AbstractClient.bulk(AbstractClient.java:141)
	at org.elasticsearch.action.bulk.BulkRequestBuilder.doExecute(BulkRequestBuilder.java:128)
	at org.elasticsearch.action.support.BaseRequestBuilder.execute(BaseRequestBuilder.java:53)
	at org.elasticsearch.action.support.BaseRequestBuilder.execute(BaseRequestBuilder.java:47)
	at org.apache.nutch.indexer.elastic.ElasticWriter.processExecute(ElasticWriter.java:117)
	at org.apache.nutch.indexer.elastic.ElasticWriter.write(ElasticWriter.java:91)
	at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:45)
	at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:40)
	at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639)
	at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
	at org.apache.nutch.indexer.IndexerJob$IndexerMapper.map(IndexerJob.java:111)
	at org.apache.nutch.indexer.IndexerJob$IndexerMapper.map(IndexerJob.java:61)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
{noformat}

*This is what I see when I start ElasticSearch:*
{noformat}
elasticsearch -f
[2012-08-31 06:31:56,832][INFO ][node                     ] [Doorman] {0.19.4}[53351]: initializing
...
[2012-08-31 06:31:56,841][INFO ][plugins                  ] [Doorman] loaded [MockSolrPlugin],
sites []
[2012-08-31 06:31:57,752][INFO ][node                     ] [Doorman] {0.19.4}[53351]: initialized
[2012-08-31 06:31:57,752][INFO ][node                     ] [Doorman] {0.19.4}[53351]: starting
...
[2012-08-31 06:31:57,812][INFO ][transport                ] [Doorman] bound_address {inet[/0.0.0.0:9301]},
publish_address {inet[/192.168.1.133:9301]}
[2012-08-31 06:32:00,898][INFO ][cluster.service          ] [Doorman] detected_master [Doppleganger][OF5TWSbpTl64qA0_VW-b_g][inet[/192.168.1.133:9300]],
added {[Doppleganger][OF5TWSbpTl64qA0_VW-b_g][inet[/192.168.1.133:9300]],}, reason: zen-disco-receive(from
master [[Doppleganger][OF5TWSbpTl64qA0_VW-b_g][inet[/192.168.1.133:9300]]])
[2012-08-31 06:32:00,911][INFO ][discovery                ] [Doorman] elasticsearch_matt/YcpHmZWfSdCgvZbg7YfA3g
[2012-08-31 06:32:00,914][INFO ][http                     ] [Doorman] bound_address {inet[/0.0.0.0:9201]},
publish_address {inet[/192.168.1.133:9201]}
[2012-08-31 06:32:00,914][INFO ][node                     ] [Doorman] {0.19.4}[53351]: started
{noformat}

Thanks,
Matt
                
> Add ElasticIndexerJob that indexes to elasticsearch
> ---------------------------------------------------
>
>                 Key: NUTCH-1445
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1445
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Ferdy Galema
>             Fix For: 2.1
>
>         Attachments: NUTCH-1445-addPropsToConfig.patch, NUTCH-1445-addToNutchScript.patch,
NUTCH-1445.patch
>
>
> We have created a new indexer job ElasticIndexerJob that indexes to elasticsearch. It
is orginally based upon https://github.com/ctjmorgan/nutch-elasticsearch-indexer (Apache2
license), but we have modified it greatly to make it integrate as good as possible into Nutch.
The greatest modification is that documents are asynchronously flushed in bulk to elasticsearch.
> Elasticsearch rocks. Both performance and ease of confiugration is awesome. You simply
deploy a server by unpacking the tar, configure the clustername, start the server and fire
away indexing requests. Indices are automatically created. Fields are automapped. (Of course
it is recommended to create your own optimized mapping, but that is beyond scope of this issue).
Multiple servers connect without extra configuration, simply by using the same clustername.
(By means of multicast). There a tons of advanced options, such as sharding, replication,
disk striping etc.
> To give an example of the performance: With 20+ nodes we are able to index over 1M docs
(average sized webdocuments) per minute. The best part is that the added documents are almost
instantly searchable, so there no hidden commit costs that Solr has. This is with out-of-the-box
configuration.
> (I will attach patch and commit for Nutch2. Feel free to adapt for trunk.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message