nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search
Date Wed, 25 Jun 2014 14:37:25 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043546#comment-14043546
] 

Julien Nioche commented on NUTCH-1798:
--------------------------------------

No problem Aaron! Ok, so it looks like you do have documents in the table that are successfully
fetched. Unfortunately 2.x lacks many of the functionalities that 1.x has (not mentioning
robustness) and that are useful for testing e.g. indexer-dummy or [NUTCH-1758]. If you have
good reasons to use 2.x and not 1.x, the best approach would be to either port these 2 patches
to 2.x or debug in local mode to see what's happening. 

See [http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse_.28NOT_VERIFIED.29]
(no idea who thought it was not verified, it should work fine) for advice on how to debug.

> Unable to get any documents to index in elastic search
> ------------------------------------------------------
>
>                 Key: NUTCH-1798
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1798
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.3
>         Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9
>            Reporter: Aaron Bedward
>             Fix For: 2.3
>
>         Attachments: part-r-00000
>
>
> Hopefully this is something i am doing wrong.  I have checked out 2.x as i would like
to use the new metatag extraction features.  I have then run ant runtime to build,  I have
updated the nutch-site.xml like so:
> <property>
>   <name>plugin.includes</name>
>  <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
>  <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin. By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please enable 
>   protocol-httpclient, but be aware of possible intermittent problems with the 
>   underlying commons-httpclient library.
>   </description>
> </property>
>   <property>
>       <name>elastic.cluster</name>
>       <value>elasticsearch</value>
>       <description>The cluster name to discover. Either host and potr must be defined
>         or cluster.</description>
>   </property>
>  
> I have then created a folder called urls and added seed.txt.
> i ran the following commands 
> bin/nutch inject urls
> bin/nutch generate -topN 1000  
> bin/nutch fetch -all
> bin/nutch parse -all
> bin/nutch updatedb
> bin/nutch index  -all 
> it runs no errors however no documents have been index
> i also tried setting up the following with solr and no documents are indexed
> Log:
> 2014-06-24 02:57:57,804 INFO  parse.ParserJob - ParserJob: success
> 2014-06-24 02:57:57,805 INFO  parse.ParserJob - ParserJob: finished at 2014-06-24 02:57:57,
time elapsed: 00:00:06
> 2014-06-24 02:57:59,823 INFO  indexer.IndexingJob - IndexingJob: starting
> 2014-06-24 02:58:00,815 INFO  basic.BasicIndexingFilter - Maximum title length for indexing
set to: 100
> 2014-06-24 02:58:00,815 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2014-06-24 02:58:01,774 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter
> 2014-06-24 02:58:01,776 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is:
off
> 2014-06-24 02:58:01,776 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2014-06-24 02:58:03,946 WARN  util.NativeCodeLoader - Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
> 2014-06-24 02:58:04,920 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] version[1.1.0], pid[21885],
build[2181e11/2014-03-25T15:59:51Z]
> 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] initializing ...
> 2014-06-24 02:58:05,377 INFO  elasticsearch.plugins - [Silver] loaded [], sites []
> 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] initialized
> 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] starting ...
> 2014-06-24 02:58:08,431 INFO  elasticsearch.transport - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9301]},
publish_address {inet[/10.0.2.15:9301]}
> 2014-06-24 02:58:11,540 INFO  cluster.service - [Silver] detected_master [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],
added {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver
Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: zen-disco-receive(from
master [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
> 2014-06-24 02:58:11,553 INFO  elasticsearch.discovery - [Silver] elasticsearch/jXIC3VT6THukKDFB7GMw7Q
> 2014-06-24 02:58:11,562 INFO  elasticsearch.http - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9201]},
publish_address {inet[/10.0.2.15:9201]}
> 2014-06-24 02:58:11,566 INFO  elasticsearch.node - [Silver] started
> 2014-06-24 02:58:11,568 INFO  basic.BasicIndexingFilter - Maximum title length for indexing
set to: 100
> 2014-06-24 02:58:11,569 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2014-06-24 02:58:11,581 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter
> 2014-06-24 02:58:11,581 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is:
off
> 2014-06-24 02:58:11,581 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2014-06-24 02:58:11,716 INFO  elastic.ElasticIndexWriter - Processing remaining requests
[docs = 0, length = 0, total docs = 0]
> 2014-06-24 02:58:11,717 INFO  elastic.ElasticIndexWriter - Processing to finalize last
execute
> 2014-06-24 02:58:11,717 INFO  elasticsearch.node - [Silver] stopping ...
> 2014-06-24 02:58:11,751 INFO  elasticsearch.node - [Silver] stopped
> 2014-06-24 02:58:11,751 INFO  elasticsearch.node - [Silver] closing ...
> 2014-06-24 02:58:11,756 INFO  elasticsearch.node - [Silver] closed
> 2014-06-24 02:58:11,759 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
> 2014-06-24 02:58:12,511 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> 2014-06-24 02:58:12,511 INFO  indexer.IndexingJob - Active IndexWriters :
> ElasticIndexWriter
> 	elastic.cluster : elastic prefix cluster
> 	elastic.host : hostname
> 	elastic.port : port  (default 9300)
> 	elastic.index : elastic index command 
> 	elastic.max.bulk.docs : elastic bulk index doc counts. (default 250) 
> 	elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
> 2014-06-24 02:58:12,525 INFO  elasticsearch.node - [Lifeguard] version[1.1.0], pid[21885],
build[2181e11/2014-03-25T15:59:51Z]
> 2014-06-24 02:58:12,525 INFO  elasticsearch.node - [Lifeguard] initializing ...
> 2014-06-24 02:58:12,555 INFO  elasticsearch.plugins - [Lifeguard] loaded [], sites []
> 2014-06-24 02:58:13,025 INFO  elasticsearch.node - [Lifeguard] initialized
> 2014-06-24 02:58:13,025 INFO  elasticsearch.node - [Lifeguard] starting ...
> 2014-06-24 02:58:13,032 INFO  elasticsearch.transport - [Lifeguard] bound_address {inet[/0:0:0:0:0:0:0:0:9301]},
publish_address {inet[/10.0.2.15:9301]}
> 2014-06-24 02:58:16,063 INFO  cluster.service - [Lifeguard] detected_master [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],
added {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver
Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: zen-disco-receive(from
master [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
> 2014-06-24 02:58:16,072 INFO  elasticsearch.discovery - [Lifeguard] elasticsearch/MWiqtTiqS5aC_M7QvGtfyg
> 2014-06-24 02:58:16,074 INFO  elasticsearch.http - [Lifeguard] bound_address {inet[/0:0:0:0:0:0:0:0:9201]},
publish_address {inet[/10.0.2.15:9201]}
> 2014-06-24 02:58:16,076 INFO  elasticsearch.node - [Lifeguard] started
> 2014-06-24 02:58:16,076 INFO  indexer.IndexingJob - IndexingJob: done.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message