nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2464) Headers That Contain HTML Elements Are Not Parsed
Date Thu, 30 Nov 2017 19:20:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273207#comment-16273207
] 

ASF GitHub Bot commented on NUTCH-2464:
---------------------------------------

sebastian-nagel commented on issue #244: Fix for NUTCH-2464 get textual content from nested
heading nodes
URL: https://github.com/apache/nutch/pull/244#issuecomment-348292136
 
 
   Looks good to me. Thanks, @jorgelbg!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Headers That Contain HTML Elements Are Not Parsed
> -------------------------------------------------
>
>                 Key: NUTCH-2464
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2464
>             Project: Nutch
>          Issue Type: Bug
>          Components: plugin
>    Affects Versions: 1.13
>         Environment: Internal development/test environments.
>            Reporter: Cass Pallansch
>         Attachments: NUTCH-2464-complex-header.html
>
>
> Nutch does not appear to traverse the HTML elements that may be contained within header
elements (e.g., H1, H2, H3, etc. tags).  Many times there are anchors and/or <span>
tags within these elements that contain the actual text nodes that should be picked up as
the header value for indexing purposes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message