nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <>
Subject [jira] [Commented] (NUTCH-2464) Headers That Contain HTML Elements Are Not Parsed
Date Tue, 21 Nov 2017 22:02:00 GMT


Sebastian Nagel commented on NUTCH-2464:

Hi [~cpallansch], could you provide an example for the failure and explain which plugins are
affected. I wasn't able to reproduce the problem with the current 2.x branch (tested both
parse-html or parse-tika): the extracted text for the HTML snippet
<h1>header with <span>span element</span></h1>
is "header with span element". That's expected. Test document attached, you can test your
configuration by running
bin/nutch parsechecker -dumpText http://.../NUTCH-2464-complex-header.html

> Headers That Contain HTML Elements Are Not Parsed
> -------------------------------------------------
>                 Key: NUTCH-2464
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>          Components: plugin
>    Affects Versions: 2.3
>         Environment: Internal development/test environments.
>            Reporter: Cass Pallansch
> Nutch does not appear to traverse the HTML elements that may be contained within header
elements (e.g., H1, H2, H3, etc. tags).  Many times there are anchors and/or <span>
tags within these elements that contain the actual text nodes that should be picked up as
the header value for indexing purposes.

This message was sent by Atlassian JIRA

View raw message