Jayesh Shende created TIKA-2382:
-----------------------------------
Summary: Remove innerText of <Script> and <Style> if present inside
<Body> after parsing HTML
Key: TIKA-2382
URL: https://issues.apache.org/jira/browse/TIKA-2382
Project: Tika
Issue Type: Improvement
Components: parser
Affects Versions: 1.13
Environment: JDK 1.8 usage with Solr 6.5.1
Reporter: Jayesh Shende
Priority: Minor
Fix For: 1.16
If fetched HTML page contains <script> and <style> tags inside <body> tag
(not in <head> tag ) then after parsing, the innerText ( i.e. EMAC/JS scripts and CSS
styles) of <Script> and <Style> remains as part of parsed text.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
|