lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Rowe (JIRA)" <>
Subject [jira] [Commented] (LUCENE-5763) HTMLStripCharFilter += HTML5
Date Thu, 19 Jun 2014 12:15:25 GMT


Steve Rowe commented on LUCENE-5763:

bq. Would it be useful at all to have a config option for the HTML version?

I don't think so - the use for this thing is generally HTML you don't control (hence the ability
to handle non-well-formed content), so it seems very unlikely that people will know which
HTML version they should target.  And I don't think we should have a mode where we output
the HTML4 versions (left: U+2329; right: U+232A), because these characters are described in
the Unicode specification as deprecated: from []:

*Deprecated angle brackets*

These characters are deprecated and are strongly discouraged for mathematical use because
of their canonical equivalence to CJK punctuation.


> HTMLStripCharFilter += HTML5 
> -----------------------------
>                 Key: LUCENE-5763
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: modules/analysis
>            Reporter: Steve Rowe
>            Priority: Minor
> HTMLStripCharFilter knows some specific things about HTML4 (like named character entities,
which are converted to the corresponding characters), but not about HTML5.
> HTML5 has way more named character entities: 2,231 vs 259 by my count.
> There's probably other stuff to do, e.g. there are new tags.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message