tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1164) InputStream get modified by content type detection
Date Fri, 08 Jul 2016 19:33:11 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368275#comment-15368275
] 

Tim Allison commented on TIKA-1164:
-----------------------------------

For anyone stumbling across this issue.  It is expected that the underlying stream will have
bytes read from it.  If the underlying stream is not resettable, then when you check available()
after detection on the underlying stream, it will be missing bytes.  The key is to reuse the
buffered stream/TikaInputStream, not the underlying stream.  

Not great:
{noformat}
        Detector detector = TikaConfig.getDefaultConfig().getDetector();
        File file = new File("testPDFVarious.pdf");
        try (FileInputStream is = new FileInputStream(file)) {
            try (InputStream tis = TikaInputStream.get(is)) {
                System.out.println("length: " + file.length());
                System.out.println("avail before: " + tis.available());
                System.out.println("DETECTED: " + detector.detect(tis, new Metadata()));
                System.out.println("avail after tis: " + tis.available());
                System.out.println("avail after is: " + is.available());
            }
        }
{noformat}
length: 205491
avail before: 205491
DETECTED: application/pdf
avail after tis: 205491
avail after is: 139955

Better:
Even better, call TikaInputStream.get() directly on a file (if you're processing files).
{noformat}
        Detector detector = TikaConfig.getDefaultConfig().getDetector();
        try (InputStream tis = TikaInputStream.get(file)) {
            System.out.println("length: " + file.length());
            System.out.println("avail before: " + tis.available());
            System.out.println("DETECTED: " + detector.detect(tis, new Metadata()));
            System.out.println("avail after tis: " + tis.available());
        }
{noformat}
length: 205491
avail before: 205491
DETECTED: application/pdf
avail after tis: 205491

> InputStream get modified by content type detection
> --------------------------------------------------
>
>                 Key: TIKA-1164
>                 URL: https://issues.apache.org/jira/browse/TIKA-1164
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.4
>         Environment: Windows 7 / Eclipse Kepler / Tomcat 7 / JavaSE 7
>            Reporter: Joël Royer
>            Priority: Blocker
>
> I'm using Tika for content type detection after file upload.
> After tika detection, file content is modified (not the same size compared to original
uploaded file).
> Here is my code:
> {code}
> AutoDetectParser parser = new AutoDetectParser();
> Detector detector = parser.getDetector();
> Metadata md = new Metadata();
> md.add(Metadata.RESOURCE_NAME_KEY, uploadedFilename);
> md.add(Metadata.CONTENT_TYPE, uploadedFileContentType);
> MediaType type = detector.detect(new BufferedInputStream(is), md);
> {code}
> Before detection, file size is correct.
> After detection, file size is lower than original.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message