tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean Coudon (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-1928) Filename detection misses when a # is in a filename
Date Mon, 04 Apr 2016 11:34:25 GMT
Jean Coudon created TIKA-1928:

             Summary: Filename detection misses when a # is in a filename
                 Key: TIKA-1928
                 URL: https://issues.apache.org/jira/browse/TIKA-1928
             Project: Tika
          Issue Type: Bug
          Components: detector
    Affects Versions: 1.12
         Environment: java 8
            Reporter: Jean Coudon
            Priority: Minor

If there is a pound character in a filename it will be detected as application/octet-stream
instead of the proper type that is detected without the filename containing the pound.
Metadata metadata = new Metadata();
Tika tika = new Tika();
metadata.add(Metadata.RESOURCE_NAME_KEY, "test#.pdf");
// tika uses NameDetector if first parameter == null
System.out.println(tika.detect(null, metadata));
// printes application/octet-stream instead of application/pdf

Tested for application/pdf and application/xml.

This message was sent by Atlassian JIRA

View raw message