tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2328) HtmlParser fails when DOCTYPE has unbalanced quotes
Date Tue, 18 Apr 2017 14:45:41 GMT
Shai Erera created TIKA-2328:

             Summary: HtmlParser fails when DOCTYPE has unbalanced quotes
                 Key: TIKA-2328
                 URL: https://issues.apache.org/jira/browse/TIKA-2328
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Shai Erera

When attempting to parse HTML documents that start like this:

        <title>PolClub - Polish Page on VicNet - Australia</title>

I receive the following exception:

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of
range: -1
	at java.lang.String.substring(String.java:1967)
	at org.ccil.cowan.tagsoup.Parser.trimquotes(Parser.java:881)
	at org.ccil.cowan.tagsoup.Parser.decl(Parser.java:856)
	at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:557)
	at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
	at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:122)

The problem seems to be in Tagsoup's {{Parser.trimquotes}}:

	private static String trimquotes(String in) {
		if (in == null) return in;
		int length = in.length();
		if (length == 0) return in;
		char s = in.charAt(0);
		char e = in.charAt(length - 1);
		if (s == e && (s == '\'' || s == '"')) {
			in = in.substring(1, in.length() - 1);
		return in;

Instead of checking for string of length 0, it should check {{if length <= 1) return in;}},
as even if the string is of length 1, there's no point trimming the quotes. Or, if the desired
behavior is to remove the leading quotes only, better protect against this case.

I know the bug is in tagsoup, but it looks like the code hasn't been touched in 6 years. I
hope it's OK to report the bug here.

This message was sent by Atlassian JIRA

View raw message