tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nicholas DiPiazza (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-2805) Should the HTML parser by default just ignore the <noscript> section?
Date Sun, 06 Jan 2019 00:05:00 GMT

     [ https://issues.apache.org/jira/browse/TIKA-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Nicholas DiPiazza updated TIKA-2805:
------------------------------------
    Description: 
The tika's HTML parser will take this:
{code:java}
<noscript><div class='noindex'>You may be trying to access this site from a secured
browser on the server. Please enable scripts and reload this page.</div></noscript>{code}
and will parse it:
{code:java}
You may be trying to access this site from a secured browser on the server. Please enable
scripts and reload this page.{code}
Shouldn't it just ignore those sections and leave those out of the parse output? 

  was:
The tika parser will take this:
{code:java}
<noscript><div class='noindex'>You may be trying to access this site from a secured
browser on the server. Please enable scripts and reload this page.</div></noscript>{code}
and will parse it:
{code:java}
You may be trying to access this site from a secured browser on the server. Please enable
scripts and reload this page.{code}
Shouldn't it just ignore those sections and leave those out of the parse output? 


> Should the HTML parser by default just ignore the <noscript> section?
> ---------------------------------------------------------------------
>
>                 Key: TIKA-2805
>                 URL: https://issues.apache.org/jira/browse/TIKA-2805
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Nicholas DiPiazza
>            Priority: Major
>
> The tika's HTML parser will take this:
> {code:java}
> <noscript><div class='noindex'>You may be trying to access this site from
a secured browser on the server. Please enable scripts and reload this page.</div></noscript>{code}
> and will parse it:
> {code:java}
> You may be trying to access this site from a secured browser on the server. Please enable
scripts and reload this page.{code}
> Shouldn't it just ignore those sections and leave those out of the parse output? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message