tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nicholas DiPiazza (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2805) Should the HTML parser by default just ignore the <noscript> section?
Date Sun, 06 Jan 2019 00:04:00 GMT
Nicholas DiPiazza created TIKA-2805:
---------------------------------------

             Summary: Should the HTML parser by default just ignore the <noscript> section?
                 Key: TIKA-2805
                 URL: https://issues.apache.org/jira/browse/TIKA-2805
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Nicholas DiPiazza


The tika parser will take this:
{code:java}
<noscript><div class='noindex'>You may be trying to access this site from a secured
browser on the server. Please enable scripts and reload this page.</div></noscript>{code}
and will parse it:
{code:java}
You may be trying to access this site from a secured browser on the server. Please enable
scripts and reload this page.{code}
Shouldn't it just ignore those sections and leave those out of the parse output? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message