nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Padiasek (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1749) Title duplicated in document body
Date Sun, 06 Apr 2014 04:07:14 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Greg Padiasek updated NUTCH-1749:
---------------------------------

    Attachment: DOMContentUtils.patch

> Title duplicated in document body
> ---------------------------------
>
>                 Key: NUTCH-1749
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1749
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.7
>            Reporter: Greg Padiasek
>         Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since the title
alone can be retrieved via DOMContentUtils.getTitle() and content is retrieved via DOMContentUtils.getText(),
there is no need to duplicate title in the content. When title is included in the content
it becomes difficult/impossible to extract document body without title. A need to extract
document body without title is visible when user wants to index or display body and title
separately.
> Attached is a patch which prevents including title in document content in the HTML parser
plugin.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message