nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Joyce (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic
Date Wed, 15 Apr 2015 15:53:59 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496426#comment-14496426
] 

Michael Joyce commented on NUTCH-1987:
--------------------------------------

Hi folks,

I'll have a patch up in a bit for this. I think my current plan to minimize the number of
changes that I'm shoving into a single patch is to:

* Add solr.server.url to nutch-default and set the value to some sane default (http://127.0.0.1:8983/solr/)
* Make the 'index' calls in the bin/nutch script generic and slightly change the call format.
* Update some variable names and echos in the bin/crawl script so it doesn't only mention
Solr and confuse people

I envision a call being something similar to this after these changes:
{code}
# Run the indexer
bin/crawl urls/ crawl/ "run_indexer" 1

# Don't run the indexer
bin/crawl urls/ crawl/ 1
{code}

I don't think this is necessarily the ideal solution but it minimizes calling formats for
people with existing setups and only really requires that a single configuration value is
added/updated. Note, this change obviously requires some/many documentation updates. I'm more
than happy to help with those as well but I wasn't including them in this ticket.

Thoughts?

> Make bin/crawl indexer agnostic
> -------------------------------
>
>                 Key: NUTCH-1987
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1987
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.9
>            Reporter: Michael Joyce
>             Fix For: 1.10
>
>
> The crawl script makes it a bit challenging to use an indexer that isn't Solr. For instance,
when I want to use the indexer-elastic plugin I still need to call the crawler script with
a fake Solr URL otherwise it will skip the indexing step all together.
> {code}
> bin/crawl urls/ crawl/ "http://fakeurl.com:9200" 1
> {code}
> It would be nice to keep configuration for the Solr indexer in the conf files (to mirror
the elastic search indexer conf and others) and to make the indexing parameter simply toggle
whether indexing does or doesn't occur instead of also trying to configure the indexer at
the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message