tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gerard Bouchar (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-2648) mime detection based on resource name detects resources as "text/x-php" instead of "text/html"
Date Wed, 23 May 2018 09:42:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16486979#comment-16486979
] 

Gerard Bouchar edited comment on TIKA-2648 at 5/23/18 9:41 AM:
---------------------------------------------------------------

So, would you accept a pull request adding a "nohttp" attribute to glob elements in tika-mimetypes.xml,
for instance ?

This would give something like 

{code}
  <mime-type type="text/x-php">
    <_comment>PHP script</_comment>
    <magic priority="50">
      <match value="&lt;?php" type="string" offset="0"/>
    </magic>
    <glob nohttp="true" pattern="*.php"/>
    <glob nohttp="true" pattern="*.php3"/>
    <glob nohttp="true" pattern="*.php4"/>
    <sub-class-of type="text/plain"/>
</mime-type>
{code}

And in the code, we would not try to match these patterns if the given resource name starts
with "http".


was (Author: gbouchar):
So, would you accept a pull request adding a "nohttp" attribute to glob elements in tika-mimetypes.xml,
for instance ?

> mime detection based on resource name detects resources as "text/x-php" instead of "text/html"

> -----------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2648
>                 URL: https://issues.apache.org/jira/browse/TIKA-2648
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Gerard Bouchar
>            Priority: Major
>
> When using tika to detect a mime type given only an URL containing ".php" and a content-type
hint of "text/html", it guesses "text/x-php", whereas one could expect "text/html".
> {code}
> TikaConfig tika = new TikaConfig();
> Metadata metadata = new Metadata();
> String url = "https://www.facebook.com/home.php";
> metadata.set(Metadata.RESOURCE_NAME_KEY, url);
> metadata.set(Metadata.CONTENT_TYPE, "text/html");
> MediaType type = tika.getDetector().detect(null, metadata);
> System.out.println(url + " is of type " + type.toString());
> // Prints https://www.facebook.com/home.php is of type text/x-php
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message