lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Uri Boness <ubon...@gmail.com>
Subject Re: solr nutch url indexing
Date Mon, 24 Aug 2009 20:42:15 GMT
How did you configure nutch?

Make sure you have the "parse-html" and "index-basic" configured. The 
HtmlParser should by default extract the page title and add to the 
parsed data, and the BasicIndexingFilter by default adds this title to 
the NutchDocument and stores it in the "title" filed. All the SolrIndex 
(actually the SolrWriter) does is converting the NuchDocument to a 
SolrInputDocument. So having these plugins configured in Nutch and 
having a field in the schema named "title" should work. (I'm assuming 
you're using the "solrindex" tool)

Cheers,
Uri

Lassalle, Thibaut wrote:
> Hi,
>
>  
>
> I would like to crawl intranets with nutch and index them with solr.
>
>  
>
> I would like to search mostly on the title of the pages (the one in
> <title>This is a title</title>)
>
>  
>
> I tried to tweak the schema.xml to do that but nothing is working. I
> just have the content indexed.
>
>  
>
> How do I index on title ?
>
>  
>
> Thanks
>
> t.
>
>
>   

Mime
View raw message