nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "kaveh minooie (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1140) index-more plugin, resetTitle method creates multiple values in the Title field
Date Fri, 07 Nov 2014 19:07:36 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

kaveh minooie updated NUTCH-1140:
---------------------------------
    Attachment: 0001-NUTCH-1140-trunk.patch
                0001-NUTCH-1140-2.x.patch

so this is still an issue, here is a sample list of urls in the wild that would trigger this
problem:

http://www.10-s.com/site/tennis-supply/site-map.html
http://www.bigappleherp.com/site/content/big_apple_cares.html
http://www.bigappleherp.com/site/content/CareSheets.html
http://www.bigappleherp.com/site/content/company_information.html
http://www.bigappleherp.com/site/content/customer_service.html
http://www.bigappleherp.com/site/content/LiveAnimals.html
http://www.bigappleherp.com/site/content/testimonials_02.html
http://www.magellangps.com/lp/truckfamily/screens.html

Now base on a bit of a reading that I did on Content Disposition, it is a reasonable alternative
way of determining a title which would mostly be just the file name, but it should NOT override
the actual title if it exist as the information in the title are far more valueable than the
file name. Not to mention that title is the actual title and should not be replaced if some
other value exist.

> index-more plugin, resetTitle method creates multiple values in the Title field
> -------------------------------------------------------------------------------
>
>                 Key: NUTCH-1140
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1140
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.3
>            Reporter: Joe Liedtke
>            Priority: Minor
>             Fix For: 1.10
>
>         Attachments: 0001-NUTCH-1140-2.x.patch, 0001-NUTCH-1140-trunk.patch, MoreIndexingFilter.093011.patch
>
>
> From the comments in MoreIndexingFilter.java, the index-more plugin is meant to reset
the Title field of a document if it contains a Content-Disposition header. The current behavior
is to add a Title regardless of whether one exists or not, which can cause issues down the
line with the Solr Indexing process, and based on a thread in the nutch user list it appears
that this is causing some users to mark the title as multi-valued in the schema:
>   http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8
> The following patch removes the title field before adding a new one, which has resolved
the issue for me:
> --- MoreIndexingFilter.old	2011-09-30 11:44:35.000000000 +0000
> +++ MoreIndexingFilter.java	2011-09-30 09:58:48.000000000 +0000
> @@ -276,6 +276,7 @@
>      for (int i=0; i<patterns.length; i++) {
>        if (matcher.contains(contentDisposition,patterns[i])) {
>          result = matcher.getMatch();
> +        doc.removeField("title");
>          doc.add("title", result.group(1));
>          break;
>        }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message