tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean Coudon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1928) Filename detection misses when a # is in a filename
Date Mon, 11 Apr 2016 08:17:25 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15234676#comment-15234676

Jean Coudon commented on TIKA-1928:

I am using Linux Mint 17.1. I couldn't manage to reproduce it with the CLI App, but the CLI
App might not use the same detection method as it requires a file when my test is actually
run with a null stream.

Yes my code is extracted from a JUnit test I built to try this out, here is the full version:


    public void testPoundInFileName() throws IOException {
        org.apache.tika.metadata.Metadata metadata = new org.apache.tika.metadata.Metadata();
        Tika tika = new Tika();
        metadata.add(org.apache.tika.metadata.Metadata.RESOURCE_NAME_KEY, "test#.pdf");
        // tika uses NameDetector if first parameter == null
        assertEquals("application/pdf", tika.detect(null, metadata));

> Filename detection misses when a # is in a filename
> ---------------------------------------------------
>                 Key: TIKA-1928
>                 URL: https://issues.apache.org/jira/browse/TIKA-1928
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.12
>         Environment: java 8
>            Reporter: Jean Coudon
>            Priority: Minor
> If there is a pound character in a filename it will be detected as application/octet-stream
instead of the proper type that is detected without the filename containing the pound.
> {code:java}
> Metadata metadata = new Metadata();
> Tika tika = new Tika();
> metadata.add(Metadata.RESOURCE_NAME_KEY, "test#.pdf");
> // tika uses NameDetector if first parameter == null
> System.out.println(tika.detect(null, metadata));
> // prints application/octet-stream instead of application/pdf
> {code}
> Tested for application/pdf and application/xml.

This message was sent by Atlassian JIRA

View raw message