nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Bedward (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search
Date Wed, 25 Jun 2014 14:45:25 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043555#comment-14043555
] 

Aaron Bedward commented on NUTCH-1798:
--------------------------------------

Ok, choose this version because it already had elastic search integration and i was hopping
to extract meta tags out the box.

I will try debuging and report back with my progress.

> Unable to get any documents to index in elastic search
> ------------------------------------------------------
>
>                 Key: NUTCH-1798
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1798
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.3
>         Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9
>            Reporter: Aaron Bedward
>             Fix For: 2.3
>
>         Attachments: part-r-00000
>
>
> Hopefully this is something i am doing wrong.  I have checked out 2.x as i would like
to use the new metatag extraction features.  I have then run ant runtime to build,  I have
updated the nutch-site.xml like so:
> <property>
>   <name>plugin.includes</name>
>  <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
>  <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin. By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please enable 
>   protocol-httpclient, but be aware of possible intermittent problems with the 
>   underlying commons-httpclient library.
>   </description>
> </property>
>   <property>
>       <name>elastic.cluster</name>
>       <value>elasticsearch</value>
>       <description>The cluster name to discover. Either host and potr must be defined
>         or cluster.</description>
>   </property>
>  
> I have then created a folder called urls and added seed.txt.
> i ran the following commands 
> bin/nutch inject urls
> bin/nutch generate -topN 1000  
> bin/nutch fetch -all
> bin/nutch parse -all
> bin/nutch updatedb
> bin/nutch index  -all 
> it runs no errors however no documents have been index
> i also tried setting up the following with solr and no documents are indexed
> Log:
> 2014-06-24 02:57:57,804 INFO  parse.ParserJob - ParserJob: success
> 2014-06-24 02:57:57,805 INFO  parse.ParserJob - ParserJob: finished at 2014-06-24 02:57:57,
time elapsed: 00:00:06
> 2014-06-24 02:57:59,823 INFO  indexer.IndexingJob - IndexingJob: starting
> 2014-06-24 02:58:00,815 INFO  basic.BasicIndexingFilter - Maximum title length for indexing
set to: 100
> 2014-06-24 02:58:00,815 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2014-06-24 02:58:01,774 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter
> 2014-06-24 02:58:01,776 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is:
off
> 2014-06-24 02:58:01,776 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2014-06-24 02:58:03,946 WARN  util.NativeCodeLoader - Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
> 2014-06-24 02:58:04,920 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] version[1.1.0], pid[21885],
build[2181e11/2014-03-25T15:59:51Z]
> 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] initializing ...
> 2014-06-24 02:58:05,377 INFO  elasticsearch.plugins - [Silver] loaded [], sites []
> 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] initialized
> 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] starting ...
> 2014-06-24 02:58:08,431 INFO  elasticsearch.transport - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9301]},
publish_address {inet[/10.0.2.15:9301]}
> 2014-06-24 02:58:11,540 INFO  cluster.service - [Silver] detected_master [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],
added {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver
Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: zen-disco-receive(from
master [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
> 2014-06-24 02:58:11,553 INFO  elasticsearch.discovery - [Silver] elasticsearch/jXIC3VT6THukKDFB7GMw7Q
> 2014-06-24 02:58:11,562 INFO  elasticsearch.http - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9201]},
publish_address {inet[/10.0.2.15:9201]}
> 2014-06-24 02:58:11,566 INFO  elasticsearch.node - [Silver] started
> 2014-06-24 02:58:11,568 INFO  basic.BasicIndexingFilter - Maximum title length for indexing
set to: 100
> 2014-06-24 02:58:11,569 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2014-06-24 02:58:11,581 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter
> 2014-06-24 02:58:11,581 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is:
off
> 2014-06-24 02:58:11,581 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2014-06-24 02:58:11,716 INFO  elastic.ElasticIndexWriter - Processing remaining requests
[docs = 0, length = 0, total docs = 0]
> 2014-06-24 02:58:11,717 INFO  elastic.ElasticIndexWriter - Processing to finalize last
execute
> 2014-06-24 02:58:11,717 INFO  elasticsearch.node - [Silver] stopping ...
> 2014-06-24 02:58:11,751 INFO  elasticsearch.node - [Silver] stopped
> 2014-06-24 02:58:11,751 INFO  elasticsearch.node - [Silver] closing ...
> 2014-06-24 02:58:11,756 INFO  elasticsearch.node - [Silver] closed
> 2014-06-24 02:58:11,759 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
> 2014-06-24 02:58:12,511 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> 2014-06-24 02:58:12,511 INFO  indexer.IndexingJob - Active IndexWriters :
> ElasticIndexWriter
> 	elastic.cluster : elastic prefix cluster
> 	elastic.host : hostname
> 	elastic.port : port  (default 9300)
> 	elastic.index : elastic index command 
> 	elastic.max.bulk.docs : elastic bulk index doc counts. (default 250) 
> 	elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
> 2014-06-24 02:58:12,525 INFO  elasticsearch.node - [Lifeguard] version[1.1.0], pid[21885],
build[2181e11/2014-03-25T15:59:51Z]
> 2014-06-24 02:58:12,525 INFO  elasticsearch.node - [Lifeguard] initializing ...
> 2014-06-24 02:58:12,555 INFO  elasticsearch.plugins - [Lifeguard] loaded [], sites []
> 2014-06-24 02:58:13,025 INFO  elasticsearch.node - [Lifeguard] initialized
> 2014-06-24 02:58:13,025 INFO  elasticsearch.node - [Lifeguard] starting ...
> 2014-06-24 02:58:13,032 INFO  elasticsearch.transport - [Lifeguard] bound_address {inet[/0:0:0:0:0:0:0:0:9301]},
publish_address {inet[/10.0.2.15:9301]}
> 2014-06-24 02:58:16,063 INFO  cluster.service - [Lifeguard] detected_master [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],
added {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver
Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: zen-disco-receive(from
master [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
> 2014-06-24 02:58:16,072 INFO  elasticsearch.discovery - [Lifeguard] elasticsearch/MWiqtTiqS5aC_M7QvGtfyg
> 2014-06-24 02:58:16,074 INFO  elasticsearch.http - [Lifeguard] bound_address {inet[/0:0:0:0:0:0:0:0:9201]},
publish_address {inet[/10.0.2.15:9201]}
> 2014-06-24 02:58:16,076 INFO  elasticsearch.node - [Lifeguard] started
> 2014-06-24 02:58:16,076 INFO  indexer.IndexingJob - IndexingJob: done.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message