nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1734) Make SolrIndexWriter more intelligent
Date Sun, 16 Mar 2014 13:18:42 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13937137#comment-13937137
] 

Lewis John McGibbney commented on NUTCH-1734:
---------------------------------------------

Excellent to see you log this issue [~lajos@protulae.com]. We can keep discussion of the issue
here.

> Make SolrIndexWriter more intelligent
> -------------------------------------
>
>                 Key: NUTCH-1734
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1734
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.7, 2.2.1
>            Reporter: Lajos Moczar
>            Priority: Minor
>
> The current mapping of the NutchDocument to SolrDocument is based on the fields in the
former which potentially can cause problems when you are using an existing Solr schema:
> 1) the existing logic requires Solr to support all Nutch fields, which might not be the
case (like segment).
> 2) you can map a Nutch field to at most 2 Solr fields (i.e. one via a <field> and
one via a <copy> tag because the source attribute is the Map key and therefore you can
only have one.
> Additionally, it would be nice to support some level of transformations, literals, etc,
like used in Solr DIH.
> I propose to make the code more intelligent so that, while supporting the existing "strict"
mapping that people are used to, allows more flexible and intelligent mapping. It will also
include a transformation architecture that can be expanded over time.
> The general approach is to reverse the building of the SolrDocument, and populate the
doc based on the Solr destination fields as defined in solrindex-mapping.xml, i.e., it populates
the doc based on what the target Solr wants to receive, not just what Nutch wants to send.
The Map of fields in solrindex-mapping.xml will be keyed by dest, i.e. the Solr field name,
not source. That way one can map a source to multiple destinations. A mapping type attribute
(defaults to just a simple copy from Nutch to Solr) will support literals and transformations.
> Note that a default "strict" mapping (i.e. the Solr schema by default MUST support all
NutchDocument fields) will be supported for backwards compatibility. I assume this will be
what people want.
> I will submit patches in due course.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message