nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Padiasek (JIRA)" <>
Subject [jira] [Updated] (NUTCH-1749) Title duplicated in document body
Date Sun, 06 Apr 2014 04:07:14 GMT


Greg Padiasek updated NUTCH-1749:

    Attachment: DOMContentUtils.patch

> Title duplicated in document body
> ---------------------------------
>                 Key: NUTCH-1749
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.7
>            Reporter: Greg Padiasek
>         Attachments: DOMContentUtils.patch
> The HTML parser plugin inserts document title into document content. Since the title
alone can be retrieved via DOMContentUtils.getTitle() and content is retrieved via DOMContentUtils.getText(),
there is no need to duplicate title in the content. When title is included in the content
it becomes difficult/impossible to extract document body without title. A need to extract
document body without title is visible when user wants to index or display body and title
> Attached is a patch which prevents including title in document content in the HTML parser

This message was sent by Atlassian JIRA

View raw message