nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (NUTCH-716) Make subcollection index filed multivalued
Date Mon, 06 Sep 2010 13:34:33 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906488#action_12906488
] 

Markus Jelsma edited comment on NUTCH-716 at 9/6/10 9:32 AM:
-------------------------------------------------------------

This patch concatenates multiple values in a single string instead of adding single values
to a multi valued field. For a test crawl i have defined the following two subcollection definitions:

 <subcollection>
  <name>asdf</name>
  <id>asdf-site</id>
  <whitelist>http://asdf/</whitelist>
  <blacklist/>
 </subcollection>

 <subcollection>
  <name>news</name>
  <id>asdf-nieuws</id>
  <whitelist>http://asdf/news/</whitelist>
  <blacklist/>
 </subcollection>

Reindexing the segments by sending them to Solr will yield the following results for a news
URL:

<doc>
  <arr name="subcollection">
    <str>asdf</str>
  </arr>
  <str name="url">http://asdf/home/</str>
</doc>
<doc>
  <arr name="subcollection">
    <str>asdf news</str>
  </arr>
  <str name="url">http://asdf/news/</str>
</doc>

Instead, i expected the following result for the second document:

<doc>
  <arr name="subcollection">
    <str>asdf</str>
    <str>news</str>
  </arr>
  <str name="url">http://asdf/news/</str>
</doc>

My Solr schema.xml has the following declaration for the subcollection field:

<field name="subcollection" type="string" stored="true" indexed="true" multiValued="true"
/>

The latest nightly build i could find:
nutch-2010-07-07_04-49-04


      was (Author: markus17):
    This patch concatenates multiple values in a single string instead of adding single values
to a multi valued field. For a test crawl i have defined the following two subcollection definitions:

 <subcollection>
  <name>asdf</name>
  <id>asdf-site</id>
  <whitelist>http://asdf/</whitelist>
  <blacklist/>
 </subcollection>

 <subcollection>
  <name>news</name>
  <id>asdf-nieuws</id>
  <whitelist>http://asdf/news/</whitelist>
  <blacklist/>
 </subcollection>

Reindexing the segments by sending them to Solr will yield the following results for a news
URL:

<doc>
  <arr name="subcollection">
    <str>asdf</str>
  </arr>
  <str name="url">http://asdf/home/</str>
</doc>
<doc>
  <arr name="subcollection">
    <str>asdf news</str>
  </arr>
  <str name="url">http://asdf/news/</str>
</doc>

Instead, i expected the following result for the second document:

<doc>
  <arr name="subcollection">
    <str>asdf</str>
    <str>news</str>
  </arr>
  <str name="url">http://asdf/news/</str>
</doc>

My Solr schema.xml has the following declaration for the subcollection field:

<field name="subcollection" type="string" stored="true" indexed="true" multiValued="true"
/>
  
> Make subcollection index filed multivalued
> ------------------------------------------
>
>                 Key: NUTCH-716
>                 URL: https://issues.apache.org/jira/browse/NUTCH-716
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.0.0
>            Reporter: Dmitry Lihachev
>             Fix For: 1.2, 2.0
>
>         Attachments: NUTCH-716-1_2.patch, NUTCH-716_multivalued_subcollection.patch
>
>
> Looks like a reasonable thing to do. Marking as 1.2 and will commit if no one objects

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message