tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2759) ScriptsExtractor incorrectly reports Javascript to characters() in SAX ContentHandler
Date Mon, 22 Oct 2018 18:09:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659252#comment-16659252
] 

Hudson commented on TIKA-2759:
------------------------------

SUCCESS: Integrated in Jenkins build tika-branch-1x #119 (See [https://builds.apache.org/job/tika-branch-1x/119/])
TIKA-2759 -- don't extract data uri if inside a <script/> element when (tallison: [https://github.com/apache/tika/commit/7a34b5866a97a95367ffd9fdc7210743bc17c754])
* (add) tika-parsers/src/test/resources/test-documents/testHTML_embedded_data_uri_js.html
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java
* (edit) tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java


> ScriptsExtractor incorrectly reports Javascript to characters() in SAX ContentHandler
> -------------------------------------------------------------------------------------
>
>                 Key: TIKA-2759
>                 URL: https://issues.apache.org/jira/browse/TIKA-2759
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.18
>            Reporter: Markus Jelsma
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 2.0.0, 1.20
>
>         Attachments: petrolicious.html
>
>
> We extract Javascript as text content while instead it is actually a script tag with
base64 inline. This inline code is decoded and reported in the characters() method of our
custom ContentHandler, and ends up as text being extracted, but it seems the Javascript start
tag itself is never reported to startElement(). The Javascript is reported to characters()
after we left the head and entered the body.
> HTML file is attached
> The following script tag:
> {code}
>   <script src="data:text/javascript;base64,Oyh3aW5kb3cuanExODN8fGpRdWVyeSkoZnVuY3Rpb24oJCl7bmV3IEltcHJvdmVkQUpBWExvZ2luKHsNCmlkOiAxNTcsDQppc0d1ZXN0OiAxLA0Kb2F1dGg6IHsiZmFjZWJvb2siOiJodHRwczpcL1wvd3d3LmZhY2Vib29rLmNvbVwvZGlhbG9nXC9vYXV0aD9zY29wZT1lbWFpbCZyZXNwb25zZV90eXBlPWNvZGUmZGlzcGxheT1wb3B1cCZjbGllbnRfaWQ9MTcyODk0MjQzMDY1MDQ4NiZyZWRpcmVjdF91cmk9aHR0cCUzQSUyRiUyRnBldHJvbGljaW91cy5jb20lMkZpbmRleC5waHAlM0ZvcHRpb24lM0Rjb21faW1wcm92ZWRfYWpheF9sb2dpbiUyNnRhc2slM0RmYWNlYm9vayIsImdvb2dsZSI6Imh0dHBzOlwvXC9hY2NvdW50cy5nb29nbGUuY29tXC9vXC9vYXV0aDJcL2F1dGg/c2NvcGU9aHR0cHMlM0ElMkYlMkZ3d3cuZ29vZ2xlYXBpcy5jb20lMkZhdXRoJTJGdXNlcmluZm8uZW1haWwraHR0cHMlM0ElMkYlMkZ3d3cuZ29vZ2xlYXBpcy5jb20lMkZhdXRoJTJGdXNlcmluZm8ucHJvZmlsZSZyZXNwb25zZV90eXBlPWNvZGUmZGlzcGxheT1wb3B1cCZjbGllbnRfaWQ9ODQ5NDk3NjQ3ODUzLW1mOThqNGdlOGkwYzlkaTFrbG9zc2YxbmdibWI2cG12LmFwcHMuZ29vZ2xldXNlcmNvbnRlbnQuY29tJnJlZGlyZWN0X3VyaT1odHRwJTNBJTJGJTJGcGV0cm9saWNpb3VzLmNvbSUyRmluZGV4LnBocCUzRm9wdGlvbiUzRGNvbV9pbXByb3ZlZF9hamF4X2xvZ2luJTI2dGFzayUzRGdvb2dsZSJ9LA0KYmdPcGFjaXR5OiAwLjQsDQpyZXR1cm5Vcmw6ICcvaXMtdGhpcy1kdXRjaC1jbGFzc2ljLWZpbmFsbHktYXMtY29vbC1hcy1hLWJtdycsDQpib3JkZXI6IHBhcnNlSW50KCdmNWY1ZjV8KnwzfCp8YzRjNGM0fCp8Nycuc3BsaXQoJ3wqfCcpWzFdKSwNCnBhZGRpbmc6IDQsDQp1c2VBSkFYOiAwLA0Kb3BlbkV2ZW50OiAnb25jbGljaycsDQp3bmRDZW50ZXI6IDAsDQpyZWdQb3B1cDogMSwNCmR1cjogMzAwLA0KdGltZW91dDogMCwNCmJhc2U6ICcvJywNCnRoZW1lOiAncGV0cm9saWNpb3VzJywNCnNvY2lhbFByb2ZpbGU6ICcnLA0Kc29jaWFsVHlwZTogJ2J0bkljbycsDQpjc3NQYXRoOiAnL21vZHVsZXMvbW9kX2ltcHJvdmVkX2FqYXhfbG9naW4vY2FjaGUvMTU3LzNkNDE4Mzk2NDk2N2Y2ZWVlYjI5MTdhOTI2OGM2MTIxLmNzcycsDQpyZWdQYWdlOiAnam9vbWxhJywNCmNhcHRjaGE6ICcnLA0Kc2hvd0hpbnQ6IDAsDQpnZW9sb2NhdGlvbjogZmFsc2UsDQp3aW5kb3dBbmltOiAnJw0KfSl9KTs="
type="text/javascript"></script>
> {code}
> gets reported outside the head (in html.p) as:
> {code}
> ;(window.jq183||jQuery)(function($){new ImprovedAJAXLogin({
> id: 157,
> isGuest: 1,
> oauth: {"facebook":"https:\/\/www.facebook.com\/dialog\/oauth?scope=email&response_type=code&display=popup&client_id=1728942430650486&redirect_uri=http%3A%2F%2Fpetrolicious.com%2Findex.php%3Foption%3Dcom_improved_ajax_login%26task%3Dfacebook","google":"https:\/\/accounts.google.com\/o\/oauth2\/auth?scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.profile&response_type=code&display=popup&client_id=849497647853-mf98j4ge8i0c9di1klossf1ngbmb6pmv.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Fpetrolicious.com%2Findex.php%3Foption%3Dcom_improved_ajax_login%26task%3Dgoogle"},
> bgOpacity: 0.4,
> returnUrl: '/is-this-dutch-classic-finally-as-cool-as-a-bmw',
> border: parseInt('f5f5f5|*|3|*|c4c4c4|*|7'.split('|*|')[1]),
> padding: 4,
> useAJAX: 0,
> openEvent: 'onclick',
> wndCenter: 0,
> regPopup: 1,
> dur: 300,
> timeout: 0,
> base: '/',
> theme: 'petrolicious',
> socialProfile: '',
> socialType: 'btnIco',
> cssPath: '/modules/mod_improved_ajax_login/cache/157/3d4183964967f6eeeb2917a9268c6121.css',
> regPage: 'joomla',
> captcha: '',
> showHint: 0,
> geolocation: false,
> windowAnim: ''
> })});
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message