nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
Date Fri, 17 Nov 2017 14:17:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257012#comment-16257012
] 

ASF GitHub Bot commented on NUTCH-1480:
---------------------------------------

sebastian-nagel commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#issuecomment-345254791
 
 
   Hi @r0ann3l, thanks! I've continued testing, and was able to feed two Solr indexes in parallel.
Great! Afaics, all requested changes have been made (also that of @lewismc).
   
   To make the configuration work out of the box, I would suggest 3 changes:
   - use only field names defined in the default schema.xml
   `ERROR: [doc=http://nutch.apache.org/] unknown field 'search'
   - default Solr core name should be "nutch" as described in the [tutorial](https://wiki.apache.org/nutch/NutchTutorial)
   
   I've tried to fix these issues in "[a fork of NUTCH-1480](https://github.com/sebastian-nagel/nutch/commits/NUTCH-1480)".
Feel free to cherry pick it from there.
   
   I've also tried to make indexer-dummy work. Without success, the file is created but then
overwritten:
   
   - there are two instances of `IndexWriters` active, each having a separate instance of
DummyIndexWriter.
      - the instance created from `IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:39)`
writes into the file
      - but later on the instance created from `IndexWriters.open(IndexWriters.java:187)`
opens the file anew, at the end there is an empty file. Because it's two instances there is
no possibility to check whether the file writer is already instantiated.
   
   I see two potential solutions:
   1. the IndexWriter interface method `open(job, name)` was defined with file indexers in
mind (cf. NUTCH-1541/[CSVIndexWriter](https://github.com/sebastian-nagel/nutch/blob/NUTCH-1541/src/plugin/indexer-csv/src/java/org/apache/nutch/indexwriter/csv/CSVIndexWriter.java#L233)),
an index writer can then decide to do nothing when called with name "commit".
   2. do not call the `commit()` method explicitly (ev. also remove it from the interface:
it does not safely work in distributed mode because it's not run in the reducers (see the
comment in RabbitIndexWriter).
   
   I tend to the second solution. It would also solve the problem of having two IndexWriters
instances active. What do you think?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> SolrIndexer to write to multiple servers.
> -----------------------------------------
>
>                 Key: NUTCH-1480
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1480
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>         Attachments: NUTCH-1480-1.6.1.patch, adding-support-for-sharding-indexer-for-solr.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a comma delimited
list of URL's using Configuration.getString(). SolrWriter should be able to handle this list
of SolrServers.
> This is useful if you want to send documents to multiple servers if no replication is
available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this issue allows
you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message