tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2759) ScriptsExtractor incorrectly reports Javascript to characters() in SAX ContentHandler
Date Mon, 22 Oct 2018 17:36:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659204#comment-16659204
] 

Hudson commented on TIKA-2759:
------------------------------

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #337 (See [https://builds.apache.org/job/tika-2.x-windows/337/])
TIKA-2759 -- don't extract data uri if inside a <script/> element when (tallison: rev
17cc77486d1c3f3e3379966b947687c57656a061)
* (add) tika-parsers/src/test/resources/test-documents/testHTML_embedded_data_uri_js.html
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java
* (edit) tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java


> ScriptsExtractor incorrectly reports Javascript to characters() in SAX ContentHandler
> -------------------------------------------------------------------------------------
>
>                 Key: TIKA-2759
>                 URL: https://issues.apache.org/jira/browse/TIKA-2759
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.18
>            Reporter: Markus Jelsma
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 2.0.0, 1.20
>
>         Attachments: petrolicious.html
>
>
> We extract Javascript as text content while instead it is actually a script tag with
base64 inline. This inline code is decoded and reported in the characters() method of our
custom ContentHandler, and ends up as text being extracted, but it seems the Javascript start
tag itself is never reported to startElement(). The Javascript is reported to characters()
after we left the head and entered the body.
> HTML file is attached
> The following script tag:
> {code}
>   <script src="data:text/javascript;base64,Oyh3aW5kb3cuanExODN8fGpRdWVyeSkoZnVuY3Rpb24oJCl7bmV3IEltcHJvdmVkQUpBWExvZ2luKHsNCmlkOiAxNTcsDQppc0d1ZXN0OiAxLA0Kb2F1dGg6IHsiZmFjZWJvb2siOiJodHRwczpcL1wvd3d3LmZhY2Vib29rLmNvbVwvZGlhbG9nXC9vYXV0aD9zY29wZT1lbWFpbCZyZXNwb25zZV90eXBlPWNvZGUmZGlzcGxheT1wb3B1cCZjbGllbnRfaWQ9MTcyODk0MjQzMDY1MDQ4NiZyZWRpcmVjdF91cmk9aHR0cCUzQSUyRiUyRnBldHJvbGljaW91cy5jb20lMkZpbmRleC5waHAlM0ZvcHRpb24lM0Rjb21faW1wcm92ZWRfYWpheF9sb2dpbiUyNnRhc2slM0RmYWNlYm9vayIsImdvb2dsZSI6Imh0dHBzOlwvXC9hY2NvdW50cy5nb29nbGUuY29tXC9vXC9vYXV0aDJcL2F1dGg/c2NvcGU9aHR0cHMlM0ElMkYlMkZ3d3cuZ29vZ2xlYXBpcy5jb20lMkZhdXRoJTJGdXNlcmluZm8uZW1haWwraHR0cHMlM0ElMkYlMkZ3d3cuZ29vZ2xlYXBpcy5jb20lMkZhdXRoJTJGdXNlcmluZm8ucHJvZmlsZSZyZXNwb25zZV90eXBlPWNvZGUmZGlzcGxheT1wb3B1cCZjbGllbnRfaWQ9ODQ5NDk3NjQ3ODUzLW1mOThqNGdlOGkwYzlkaTFrbG9zc2YxbmdibWI2cG12LmFwcHMuZ29vZ2xldXNlcmNvbnRlbnQuY29tJnJlZGlyZWN0X3VyaT1odHRwJTNBJTJGJTJGcGV0cm9saWNpb3VzLmNvbSUyRmluZGV4LnBocCUzRm9wdGlvbiUzRGNvbV9pbXByb3ZlZF9hamF4X2xvZ2luJTI2dGFzayUzRGdvb2dsZSJ9LA0KYmdPcGFjaXR5OiAwLjQsDQpyZXR1cm5Vcmw6ICcvaXMtdGhpcy1kdXRjaC1jbGFzc2ljLWZpbmFsbHktYXMtY29vbC1hcy1hLWJtdycsDQpib3JkZXI6IHBhcnNlSW50KCdmNWY1ZjV8KnwzfCp8YzRjNGM0fCp8Nycuc3BsaXQoJ3wqfCcpWzFdKSwNCnBhZGRpbmc6IDQsDQp1c2VBSkFYOiAwLA0Kb3BlbkV2ZW50OiAnb25jbGljaycsDQp3bmRDZW50ZXI6IDAsDQpyZWdQb3B1cDogMSwNCmR1cjogMzAwLA0KdGltZW91dDogMCwNCmJhc2U6ICcvJywNCnRoZW1lOiAncGV0cm9saWNpb3VzJywNCnNvY2lhbFByb2ZpbGU6ICcnLA0Kc29jaWFsVHlwZTogJ2J0bkljbycsDQpjc3NQYXRoOiAnL21vZHVsZXMvbW9kX2ltcHJvdmVkX2FqYXhfbG9naW4vY2FjaGUvMTU3LzNkNDE4Mzk2NDk2N2Y2ZWVlYjI5MTdhOTI2OGM2MTIxLmNzcycsDQpyZWdQYWdlOiAnam9vbWxhJywNCmNhcHRjaGE6ICcnLA0Kc2hvd0hpbnQ6IDAsDQpnZW9sb2NhdGlvbjogZmFsc2UsDQp3aW5kb3dBbmltOiAnJw0KfSl9KTs="
type="text/javascript"></script>
> {code}
> gets reported outside the head (in html.p) as:
> {code}
> ;(window.jq183||jQuery)(function($){new ImprovedAJAXLogin({
> id: 157,
> isGuest: 1,
> oauth: {"facebook":"https:\/\/www.facebook.com\/dialog\/oauth?scope=email&response_type=code&display=popup&client_id=1728942430650486&redirect_uri=http%3A%2F%2Fpetrolicious.com%2Findex.php%3Foption%3Dcom_improved_ajax_login%26task%3Dfacebook","google":"https:\/\/accounts.google.com\/o\/oauth2\/auth?scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.profile&response_type=code&display=popup&client_id=849497647853-mf98j4ge8i0c9di1klossf1ngbmb6pmv.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Fpetrolicious.com%2Findex.php%3Foption%3Dcom_improved_ajax_login%26task%3Dgoogle"},
> bgOpacity: 0.4,
> returnUrl: '/is-this-dutch-classic-finally-as-cool-as-a-bmw',
> border: parseInt('f5f5f5|*|3|*|c4c4c4|*|7'.split('|*|')[1]),
> padding: 4,
> useAJAX: 0,
> openEvent: 'onclick',
> wndCenter: 0,
> regPopup: 1,
> dur: 300,
> timeout: 0,
> base: '/',
> theme: 'petrolicious',
> socialProfile: '',
> socialType: 'btnIco',
> cssPath: '/modules/mod_improved_ajax_login/cache/157/3d4183964967f6eeeb2917a9268c6121.css',
> regPage: 'joomla',
> captcha: '',
> showHint: 0,
> geolocation: false,
> windowAnim: ''
> })});
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message