nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tejas Patil (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1047) Pluggable indexing backends
Date Sun, 27 Jan 2013 12:29:22 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13563793#comment-13563793
] 

Tejas Patil commented on NUTCH-1047:
------------------------------------

Hi Julien,
I am trying out the patch and facing an issue. Maybe I am using it the wrong way. Here is
what I did:
After setting up nuch+solr and changing schema.xml as per [wiki|http://wiki.apache.org/nutch/NutchTutorial],
I applied the patch. If I dont pass the -D option in crawl command, it throws an exception
indicating _"Missing SOLR URL"_. I believe that _-solr_ option along with the url also needs
to be provided else it wont perform the indexing part. To run a test crawl, I use this command:
{noformat}bin/nutch crawl -D solr.server.url=http://localhost:8983/solr/ urls  -solr http://localhost:8983/solr/
 -depth 5 -topN 5000{noformat}

It gives me an exception saying: _"ERROR: [doc=http://searchhub.org/2009/03/09/nutch-solr/]
unknown field 'content'"_ . I have no clue about this. Can you kindly point out where I went
wrong ?

Also, the crawl command above needs the solr url to be specified twice. Is there a way to
run it with the solr url being specified just once ?
                
> Pluggable indexing backends
> ---------------------------
>
>                 Key: NUTCH-1047
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1047
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>              Labels: indexing
>             Fix For: 1.7
>
>         Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch,
NUTCH-1047-1.x-v4.patch
>
>
> One possible feature would be to add a new endpoint for indexing-backends and make the
indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources
like ElasticSearch are becoming more popular it would be better to handle this as plugins.
Not sure about the name of the endpoint though : we already have indexing-plugins (which are
about generating fields sent to the backends) and moreover the backends are not necessarily
for indexing / searching but could be just an external storage e.g. CouchDB. The term backend
on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend'
is the best name that came to my mind so far - please suggest better ones.
> We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning
and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and
deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message