nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (NUTCH-2611) Add line-breaks when parsing HTML block-level elements
Date Thu, 28 Jun 2018 09:09:00 GMT


ASF GitHub Bot commented on NUTCH-2611:

sebastian-nagel edited a comment on issue #354: NUTCH-2611: Add line-breaks when parsing HTML
block-level elements
   +1 lgtm. The plain-text layout is now indeed more readable - line breaks after head lines,
`<p>`, etc. Will commit soon if there are no objections. Thanks, @YossiTamari!

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

> Add line-breaks when parsing HTML block-level elements
> ------------------------------------------------------
>                 Key: NUTCH-2611
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.14
>            Reporter: Yossi Tamari
>            Priority: Major
>             Fix For: 1.15
> Currently, the HTML and Tika parser only add newlines following text-nodes that contain
only whitespaces (e.g </span> <span>), but not based on what the tags are, so
for example a </div><div> will not add a new line.
> While some applications do not differentiate between a space and a new line, many others
see the semantic difference (two following words in the same sentence are "near", but in separate
sentences they are not).
> I believe adding newlines after block-level HTML elements, while not a panacea, will
be an improvement on the current behavior.
> NUTCH-2318 is related to this.

This message was sent by Atlassian JIRA

View raw message