nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic
Date Wed, 15 Apr 2015 21:02:58 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496986#comment-14496986
] 

Sebastian Nagel commented on NUTCH-1987:
----------------------------------------

Agreed: it's time to skip the Solr-URL because we support alternative indexing back-ends.
And it's good to add a default Sorl-URL to nutch-default.xml and document the property this
way.
Whether or not to run the indexer is an option. Instead of still relying on a magic positional
parameter, wouldn't it be more natural to do this by command-line options:
{code:none}
# -i  index crawled content
# -D  <property=value>  passed to Nutch commands/tools
bin/crawl -i -D solr.server.url=http://.../solr/  urls/ crawl/ 3
# equivalent if solr.server.url is default or defined in nutch-site.xml:
bin/crawl -i urls/ crawl/ 3
# does not harm to keep this for back-ward compatibility:
bin/crawl urls/ crawl/ http://.../solr/ 3
{code}
This would make the options extensible and allows to add new ones, e.g., to enable/disable
link inversion or webgraph creation.

> Make bin/crawl indexer agnostic
> -------------------------------
>
>                 Key: NUTCH-1987
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1987
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.9
>            Reporter: Michael Joyce
>             Fix For: 1.10
>
>
> The crawl script makes it a bit challenging to use an indexer that isn't Solr. For instance,
when I want to use the indexer-elastic plugin I still need to call the crawler script with
a fake Solr URL otherwise it will skip the indexing step all together.
> {code}
> bin/crawl urls/ crawl/ "http://fakeurl.com:9200" 1
> {code}
> It would be nice to keep configuration for the Solr indexer in the conf files (to mirror
the elastic search indexer conf and others) and to make the indexing parameter simply toggle
whether indexing does or doesn't occur instead of also trying to configure the indexer at
the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message