tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1141) javascript files that contain "<html" are detected as text/html
Date Wed, 03 Feb 2016 17:30:40 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130723#comment-15130723

Nick Burch commented on TIKA-1141:

I've tweaked the mime magic for HTML, so we give <html a lower priority if it isn't near
the start. As long as the .js filename is given, Tika is able to correctly identify these
JQuery files as application/javascript now. Without the filename it can't, as we don't have
any javascript magic. Not sure if we could add any either, given the format, but if someone
wants to take a stab that'd be great!

> javascript files that contain "<html" are detected as text/html
> ---------------------------------------------------------------
>                 Key: TIKA-1141
>                 URL: https://issues.apache.org/jira/browse/TIKA-1141
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.2
>            Reporter: David Hara
>            Priority: Minor
> The Mimetypes detector will return text/html as the mimetype for any javascript file
that contains the string "<html" in it. I believe this is due to the rule <match value="&lt;html"
type="string" offset="0:8192"/> in the tika-mimetypes.xml file.

This message was sent by Atlassian JIRA

View raw message