tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Bonniot de Ruisselet (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-539) Encoding detection is too biased by encoding in meta tag
Date Wed, 22 Feb 2012 11:04:48 GMT

    [ https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13213524#comment-13213524
] 

Daniel Bonniot de Ruisselet commented on TIKA-539:
--------------------------------------------------

I agree with Ken's comment from 05/Nov/10 21:35, detection should only be used when the declared
information is incorrect (saving time and avoiding wrong detection).

"A similar approach (excluding meta tags) should be used by the TXTParser." Agreed. I'm just
seeing a case where "Indanyl" is recognized as IBM500, when even "UTF-8" is specified. Should
a separate taks be opened for that?
                
> Encoding detection is too biased by encoding in meta tag
> --------------------------------------------------------
>
>                 Key: TIKA-539
>                 URL: https://issues.apache.org/jira/browse/TIKA-539
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata, parser
>    Affects Versions: 0.8, 0.9, 0.10
>            Reporter: Reinhard Schwab
>            Assignee: Ken Krugler
>             Fix For: 1.1
>
>         Attachments: TIKA-539.patch, TIKA-539_2.patch
>
>
> if the encoding in the meta tag is wrong, this encoding is detected,
> even if there is the right encoding set in metadata before(which can be  from http response
header).
> test code to reproduce:
> static String content = "<html><head>\n"
> 			+ "<meta http-equiv=\"content-type\" content=\"application/xhtml+xml; charset=iso-8859-1\"
/>"
> 			+ "</head><body>Über den Wolken\n</body></html>";
> 	/**
> 	 * @param args
> 	 * @throws IOException
> 	 * @throws TikaException
> 	 * @throws SAXException
> 	 */
> 	public static void main(String[] args) throws IOException, SAXException,
> 			TikaException {
> 		Metadata metadata = new Metadata();
> 		metadata.set(Metadata.CONTENT_TYPE, "text/html");
> 		metadata.set(Metadata.CONTENT_ENCODING, "UTF-8");
> 		System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
> 		InputStream in = new ByteArrayInputStream(content.getBytes("UTF-8"));
> 		AutoDetectParser parser = new AutoDetectParser();
> 		BodyContentHandler h = new BodyContentHandler(10000);
> 		parser.parse(in, h, metadata, new ParseContext());
> 		System.out.print(h.toString());
> 		System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
> 	}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message